mirror of
				https://github.com/python/cpython.git
				synced 2025-11-04 07:31:38 +00:00 
			
		
		
		
	
		
			
	
	
		
			508 lines
		
	
	
	
		
			19 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
		
		
			
		
	
	
			508 lines
		
	
	
	
		
			19 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| 
								 | 
							
								Developer Notes for Python Compiler
							 | 
						||
| 
								 | 
							
								===================================
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Table of Contents
							 | 
						||
| 
								 | 
							
								-----------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								- Scope
							 | 
						||
| 
								 | 
							
								    Defines the limits of the change
							 | 
						||
| 
								 | 
							
								- Parse Trees
							 | 
						||
| 
								 | 
							
								    Describes the local (Python) concept
							 | 
						||
| 
								 | 
							
								- Abstract Syntax Trees (AST)
							 | 
						||
| 
								 | 
							
								    Describes the AST technology used
							 | 
						||
| 
								 | 
							
								- Parse Tree to AST
							 | 
						||
| 
								 | 
							
								    Defines the transform approach
							 | 
						||
| 
								 | 
							
								- Control Flow Graphs
							 | 
						||
| 
								 | 
							
								    Defines the creation of "basic blocks"
							 | 
						||
| 
								 | 
							
								- AST to CFG to Bytecode
							 | 
						||
| 
								 | 
							
								    Tracks the flow from AST to bytecode
							 | 
						||
| 
								 | 
							
								- Code Objects
							 | 
						||
| 
								 | 
							
								    Pointer to making bytecode "executable"
							 | 
						||
| 
								 | 
							
								- Modified Files
							 | 
						||
| 
								 | 
							
								    Files added/modified/removed from CPython compiler
							 | 
						||
| 
								 | 
							
								- ToDo
							 | 
						||
| 
								 | 
							
								    Work yet remaining (before complete)
							 | 
						||
| 
								 | 
							
								- References
							 | 
						||
| 
								 | 
							
								    Academic and technical references to technology used.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Scope
							 | 
						||
| 
								 | 
							
								-----
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Historically (through 2.4), compilation from source code to bytecode
							 | 
						||
| 
								 | 
							
								involved two steps:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								1. Parse the source code into a parse tree (Parser/pgen.c)
							 | 
						||
| 
								 | 
							
								2. Emit bytecode based on the parse tree (Python/compile.c)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Historically, this is not how a standard compiler works.  The usual
							 | 
						||
| 
								 | 
							
								steps for compilation are:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								1. Parse source code into a parse tree (Parser/pgen.c)
							 | 
						||
| 
								 | 
							
								2. Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
							 | 
						||
| 
								 | 
							
								3. Transform AST into a Control Flow Graph (Python/newcompile.c)
							 | 
						||
| 
								 | 
							
								4. Emit bytecode based on the Control Flow Graph (Python/newcompile.c)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Starting with Python 2.5, the above steps are now used.  This change
							 | 
						||
| 
								 | 
							
								was done to simplify compilation by breaking it into three steps.
							 | 
						||
| 
								 | 
							
								The purpose of this document is to outline how the lattter three steps
							 | 
						||
| 
								 | 
							
								of the process works.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								This document does not touch on how parsing works beyond what is needed
							 | 
						||
| 
								 | 
							
								to explain what is needed for compilation.  It is also not exhaustive
							 | 
						||
| 
								 | 
							
								in terms of the how the entire system works.  You will most likely need
							 | 
						||
| 
								 | 
							
								to read some source to have an exact understanding of all details.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Parse Trees
							 | 
						||
| 
								 | 
							
								-----------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Python's parser is an LL(1) parser mostly based off of the
							 | 
						||
| 
								 | 
							
								implementation laid out in the Dragon Book [Aho86]_.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The grammar file for Python can be found in Grammar/Grammar with the
							 | 
						||
| 
								 | 
							
								numeric value of grammar rules are stored in Include/graminit.h.  The
							 | 
						||
| 
								 | 
							
								numeric values for types of tokens (literal tokens, such as ``:``,
							 | 
						||
| 
								 | 
							
								numbers, etc.) are kept in Include/token.h).  The parse tree made up of
							 | 
						||
| 
								 | 
							
								``node *`` structs (as defined in Include/node.h).
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Querying data from the node structs can be done with the following
							 | 
						||
| 
								 | 
							
								macros (which are all defined in Include/token.h):
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								- ``CHILD(node *, int)``
							 | 
						||
| 
								 | 
							
									Returns the nth child of the node using zero-offset indexing
							 | 
						||
| 
								 | 
							
								- ``RCHILD(node *, int)``
							 | 
						||
| 
								 | 
							
									Returns the nth child of the node from the right side; use
							 | 
						||
| 
								 | 
							
									negative numbers!
							 | 
						||
| 
								 | 
							
								- ``NCH(node *)``
							 | 
						||
| 
								 | 
							
									Number of children the node has
							 | 
						||
| 
								 | 
							
								- ``STR(node *)``
							 | 
						||
| 
								 | 
							
									String representation of the node; e.g., will return ``:`` for a
							 | 
						||
| 
								 | 
							
									COLON token
							 | 
						||
| 
								 | 
							
								- ``TYPE(node *)``
							 | 
						||
| 
								 | 
							
									The type of node as specified in ``Include/graminit.h``
							 | 
						||
| 
								 | 
							
								- ``REQ(node *, TYPE)``
							 | 
						||
| 
								 | 
							
									Assert that the node is the type that is expected
							 | 
						||
| 
								 | 
							
								- ``LINENO(node *)``
							 | 
						||
| 
								 | 
							
									retrieve the line number of the source code that led to the
							 | 
						||
| 
								 | 
							
									creation of the parse rule; defined in Python/ast.c
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								To tie all of this example, consider the rule for 'while'::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  while_stmt: 'while' test ':' suite ['else' ':' suite]
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The node representing this will have ``TYPE(node) == while_stmt`` and
							 | 
						||
| 
								 | 
							
								the number of children can be 4 or 7 depending on if there is an 'else'
							 | 
						||
| 
								 | 
							
								statement.  To access what should be the first ':' and require it be an
							 | 
						||
| 
								 | 
							
								actual ':' token, `(REQ(CHILD(node, 2), COLON)``.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Abstract Syntax Trees (AST)
							 | 
						||
| 
								 | 
							
								---------------------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The abstract syntax tree (AST) is a high-level representation of the
							 | 
						||
| 
								 | 
							
								program structure without the necessity of containing the source code;
							 | 
						||
| 
								 | 
							
								it can be thought of a abstract representation of the source code.  The
							 | 
						||
| 
								 | 
							
								specification of the AST nodes is specified using the Zephyr Abstract
							 | 
						||
| 
								 | 
							
								Syntax Definition Language (ASDL) [Wang97]_.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The definition of the AST nodes for Python is found in the file
							 | 
						||
| 
								 | 
							
								Parser/Python.asdl .
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Each AST node (representing statements, expressions, and several
							 | 
						||
| 
								 | 
							
								specialized types, like list comprehensions and exception handlers) is
							 | 
						||
| 
								 | 
							
								defined by the ASDL.  Most definitions in the AST correspond to a
							 | 
						||
| 
								 | 
							
								particular source construct, such as an 'if' statement or an attribute
							 | 
						||
| 
								 | 
							
								lookup.  The definition is independent of its realization in any
							 | 
						||
| 
								 | 
							
								particular programming language.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The following fragment of the Python ASDL construct demonstrates the
							 | 
						||
| 
								 | 
							
								approach and syntax::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  module Python
							 | 
						||
| 
								 | 
							
								  {
							 | 
						||
| 
								 | 
							
									stmt = FunctionDef(identifier name, arguments args, stmt* body,
							 | 
						||
| 
								 | 
							
											    expr* decorators)
							 | 
						||
| 
								 | 
							
									      | Return(expr? value) | Yield(expr value)
							 | 
						||
| 
								 | 
							
									      attributes (int lineno)
							 | 
						||
| 
								 | 
							
								  }
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The preceding example describes three different kinds of statements;
							 | 
						||
| 
								 | 
							
								function definitions, return statements, and yield statements.  All
							 | 
						||
| 
								 | 
							
								three kinds are considered of type stmt as shown by '|' separating the
							 | 
						||
| 
								 | 
							
								various kinds.  They all take arguments of various kinds and amounts.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Modifiers on the argument type specify the number of values needed; '?'
							 | 
						||
| 
								 | 
							
								means it is optional, '*' means 0 or more, no modifier means only one
							 | 
						||
| 
								 | 
							
								value for the argument and it is required.  FunctionDef, for instance,
							 | 
						||
| 
								 | 
							
								takes an identifier for the name, 'arguments' for args, zero or more
							 | 
						||
| 
								 | 
							
								stmt arguments for 'body', and zero or more expr arguments for
							 | 
						||
| 
								 | 
							
								'decorators'.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Do notice that something like 'arguments', which is a node type, is
							 | 
						||
| 
								 | 
							
								represented as a single AST node and not as a sequence of nodes as with
							 | 
						||
| 
								 | 
							
								stmt as one might expect.  
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								All three kinds also have an 'attributes' argument; this is shown by the
							 | 
						||
| 
								 | 
							
								fact that 'attributes' lacks a '|' before it.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The statement definitions above generate the following C structure type::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  typedef struct _stmt *stmt_ty;
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  struct _stmt {
							 | 
						||
| 
								 | 
							
								        enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind;
							 | 
						||
| 
								 | 
							
								        union {
							 | 
						||
| 
								 | 
							
								                struct {
							 | 
						||
| 
								 | 
							
								                        identifier name;
							 | 
						||
| 
								 | 
							
								                        arguments_ty args;
							 | 
						||
| 
								 | 
							
								                        asdl_seq *body;
							 | 
						||
| 
								 | 
							
								                } FunctionDef;
							 | 
						||
| 
								 | 
							
								                
							 | 
						||
| 
								 | 
							
								                struct {
							 | 
						||
| 
								 | 
							
								                        expr_ty value;
							 | 
						||
| 
								 | 
							
								                } Return;
							 | 
						||
| 
								 | 
							
								                
							 | 
						||
| 
								 | 
							
								                struct {
							 | 
						||
| 
								 | 
							
								                        expr_ty value;
							 | 
						||
| 
								 | 
							
								                } Yield;
							 | 
						||
| 
								 | 
							
								        } v;
							 | 
						||
| 
								 | 
							
								        int lineno;
							 | 
						||
| 
								 | 
							
								   }
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Also generated are a series of constructor functions that allocate (in
							 | 
						||
| 
								 | 
							
								this case) a stmt_ty struct with the appropriate initialization.  The
							 | 
						||
| 
								 | 
							
								'kind' field specifies which component of the union is initialized.  The
							 | 
						||
| 
								 | 
							
								FunctionDef() constructor function sets 'kind' to FunctionDef_kind and
							 | 
						||
| 
								 | 
							
								initializes the 'name', 'args', 'body', and 'attributes' fields.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								*** NOTE: if you make a change here that can affect the output of bytecode that
							 | 
						||
| 
								 | 
							
								is already in existence, make sure to delete your old .py(c|o) files!  Running
							 | 
						||
| 
								 | 
							
								``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Parse Tree to AST
							 | 
						||
| 
								 | 
							
								-----------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The AST is generated from the parse tree in (see Python/ast.c) using the
							 | 
						||
| 
								 | 
							
								function::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  mod_ty PyAST_FromNode(const node *n);
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The function begins a tree walk of the parse tree, creating various AST
							 | 
						||
| 
								 | 
							
								nodes as it goes along.  It does this by allocating all new nodes it
							 | 
						||
| 
								 | 
							
								needs, calling the proper AST node creation functions for any required
							 | 
						||
| 
								 | 
							
								supporting functions, and connecting them as needed.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Do realize that there is no automated nor symbolic connection between
							 | 
						||
| 
								 | 
							
								the grammar specification and the nodes in the parse tree.  No help is
							 | 
						||
| 
								 | 
							
								directly provided by the parse tree as in yacc.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								For instance, one must keep track of
							 | 
						||
| 
								 | 
							
								which node in the parse tree one is working with (e.g., if you are
							 | 
						||
| 
								 | 
							
								working with an 'if' statement you need to watch out for the ':' token
							 | 
						||
| 
								 | 
							
								to find the end of the conditional).  No help is directly provided by
							 | 
						||
| 
								 | 
							
								the parse tree as in yacc.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The functions called to generate AST nodes from the parse tree all have
							 | 
						||
| 
								 | 
							
								the name ast_for_xx where xx is what the grammar rule that the function
							 | 
						||
| 
								 | 
							
								handles (alias_for_import_name is the exception to this).  These in turn
							 | 
						||
| 
								 | 
							
								call the constructor functions as defined by the ASDL grammar and
							 | 
						||
| 
								 | 
							
								contained in Python/Python-ast.c (which was generated by
							 | 
						||
| 
								 | 
							
								Parser/asdl_c.py) to create the nodes of the AST.  This all leads to a
							 | 
						||
| 
								 | 
							
								sequence of AST nodes stored in asdl_seq structs.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Function and macros for creating and using ``asdl_seq *`` types as found
							 | 
						||
| 
								 | 
							
								in Python/asdl.c and Include/asdl.h:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								- ``asdl_seq_new(int)``
							 | 
						||
| 
								 | 
							
									Allocate memory for an asdl_seq for length 'size'
							 | 
						||
| 
								 | 
							
								- ``asdl_seq_free(asdl_seq *)``
							 | 
						||
| 
								 | 
							
									Free asdl_seq struct
							 | 
						||
| 
								 | 
							
								- ``asdl_seq_GET(asdl_seq *seq, int pos)``
							 | 
						||
| 
								 | 
							
									Get item held at 'pos'
							 | 
						||
| 
								 | 
							
								- ``asdl_seq_SET(asdl_seq *seq, int pos, void *val)``
							 | 
						||
| 
								 | 
							
									Set 'pos' in 'seq' to 'val'
							 | 
						||
| 
								 | 
							
								- ``asdl_seq_APPEND(asdl_seq *seq, void *val)``
							 | 
						||
| 
								 | 
							
									Set the end of 'seq' to 'val'
							 | 
						||
| 
								 | 
							
								- ``asdl_seq_LEN(asdl_seq *)``
							 | 
						||
| 
								 | 
							
									Return the length of 'seq'
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								If you are working with statements, you must also worry about keeping
							 | 
						||
| 
								 | 
							
								track of what line number generated the statement.  Currently the line
							 | 
						||
| 
								 | 
							
								number is passed as the last parameter to each stmt_ty function.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Control Flow Graphs
							 | 
						||
| 
								 | 
							
								-------------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								A control flow graph (often referenced by its acronym, CFG) is a
							 | 
						||
| 
								 | 
							
								directed graph that models the flow of a program using basic blocks that
							 | 
						||
| 
								 | 
							
								contain the intermediate representation (abbreviated "IR", and in this
							 | 
						||
| 
								 | 
							
								case is Python bytecode) within the blocks.  Basic blocks themselves are
							 | 
						||
| 
								 | 
							
								a block of IR that has a single entry point but possibly multiple exit
							 | 
						||
| 
								 | 
							
								points.  The single entry point is the key to basic blocks; it all has
							 | 
						||
| 
								 | 
							
								to do with jumps.  An entry point is the target of something that
							 | 
						||
| 
								 | 
							
								changes control flow (such as a function call or a jump) while exit
							 | 
						||
| 
								 | 
							
								points are instructions that would change the flow of the program (such
							 | 
						||
| 
								 | 
							
								as jumps and 'return' statements).  What this means is that a basic
							 | 
						||
| 
								 | 
							
								block is a chunk of code that starts at the entry point and runs to an
							 | 
						||
| 
								 | 
							
								exit point or the end of the block.
							 | 
						||
| 
								 | 
							
								  
							 | 
						||
| 
								 | 
							
								As an example, consider an 'if' statement with an 'else' block.  The
							 | 
						||
| 
								 | 
							
								guard on the 'if' is a basic block which is pointed to by the basic
							 | 
						||
| 
								 | 
							
								block containing the code leading to the 'if' statement.  The 'if'
							 | 
						||
| 
								 | 
							
								statement block contains jumps (which are exit points) to the true body
							 | 
						||
| 
								 | 
							
								of the 'if' and the 'else' body (which may be NULL), each of which are
							 | 
						||
| 
								 | 
							
								their own basic blocks.  Both of those blocks in turn point to the
							 | 
						||
| 
								 | 
							
								basic block representing the code following the entire 'if' statement.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								CFGs are usually one step away from final code output.  Code is directly
							 | 
						||
| 
								 | 
							
								generated from the basic blocks (with jump targets adjusted based on the
							 | 
						||
| 
								 | 
							
								output order) by doing a post-order depth-first search on the CFG
							 | 
						||
| 
								 | 
							
								following the edges.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								AST to CFG to Bytecode
							 | 
						||
| 
								 | 
							
								----------------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								With the AST created, the next step is to create the CFG. The first step
							 | 
						||
| 
								 | 
							
								is to convert the AST to Python bytecode without having jump targets
							 | 
						||
| 
								 | 
							
								resolved to specific offsets (this is calculated when the CFG goes to
							 | 
						||
| 
								 | 
							
								final bytecode). Essentially, this transforms the AST into Python
							 | 
						||
| 
								 | 
							
								bytecode with control flow represented by the edges of the CFG.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Conversion is done in two passes.  The first creates the namespace
							 | 
						||
| 
								 | 
							
								(variables can be classified as local, free/cell for closures, or
							 | 
						||
| 
								 | 
							
								global).  With that done, the second pass essentially flattens the CFG
							 | 
						||
| 
								 | 
							
								into a list and calculates jump offsets for final output of bytecode.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The conversion process is initiated by a call to the function in
							 | 
						||
| 
								 | 
							
								Python/newcompile.c::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  PyCodeObject * PyAST_Compile(mod_ty, const char *, PyCompilerFlags);
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								This function does both the conversion of the AST to a CFG and
							 | 
						||
| 
								 | 
							
								outputting final bytecode from the CFG.  The AST to CFG step is handled
							 | 
						||
| 
								 | 
							
								mostly by the two functions called by PyAST_Compile()::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  struct symtable * PySymtable_Build(mod_ty, const char *,
							 | 
						||
| 
								 | 
							
													PyFutureFeatures);
							 | 
						||
| 
								 | 
							
								  PyCodeObject * compiler_mod(struct compiler *, mod_ty);
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The former is in Python/symtable.c while the latter is in
							 | 
						||
| 
								 | 
							
								Python/newcompile.c .
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								PySymtable_Build() begins by entering the starting code block for the
							 | 
						||
| 
								 | 
							
								AST (passed-in) and then calling the proper symtable_visit_xx function
							 | 
						||
| 
								 | 
							
								(with xx being the AST node type).  Next, the AST tree is walked with
							 | 
						||
| 
								 | 
							
								the various code blocks that delineate the reach of a local variable
							 | 
						||
| 
								 | 
							
								as blocks are entered and exited::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  static int symtable_enter_block(struct symtable *, identifier,
							 | 
						||
| 
								 | 
							
												    block_ty, void *, int);
							 | 
						||
| 
								 | 
							
								  static int symtable_exit_block(struct symtable *, void *);
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Once the symbol table is created, it is time for CFG creation, whose
							 | 
						||
| 
								 | 
							
								code is in Python/newcompile.c .  This is handled by several functions
							 | 
						||
| 
								 | 
							
								that break the task down by various AST node types.  The functions are
							 | 
						||
| 
								 | 
							
								all named compiler_visit_xx where xx is the name of the node type (such
							 | 
						||
| 
								 | 
							
								as stmt, expr, etc.).  Each function receives a ``struct compiler *``
							 | 
						||
| 
								 | 
							
								and xx_ty where xx is the AST node type.  Typically these functions
							 | 
						||
| 
								 | 
							
								consist of a large 'switch' statement, branching based on the kind of
							 | 
						||
| 
								 | 
							
								node type passed to it.  Simple things are handled inline in the
							 | 
						||
| 
								 | 
							
								'switch' statement with more complex transformations farmed out to other
							 | 
						||
| 
								 | 
							
								functions named compiler_xx with xx being a descriptive name of what is
							 | 
						||
| 
								 | 
							
								being handled.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								When transforming an arbitrary AST node, use the VISIT macro::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  VISIT(struct compiler *, <node type>, <AST node>);
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The appropriate compiler_visit_xx function is called, based on the value
							 | 
						||
| 
								 | 
							
								passed in for <node type> (so ``VISIT(c, expr, node)`` calls
							 | 
						||
| 
								 | 
							
								``compiler_visit_expr(c, node)``).  The VISIT_SEQ macro is very similar,
							 | 
						||
| 
								 | 
							
								 but is called on AST node sequences (those values that were created as
							 | 
						||
| 
								 | 
							
								arguments to a node that used the '*' modifier).  There is also
							 | 
						||
| 
								 | 
							
								VISIT_SLICE just for handling slices::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  VISIT_SLICE(struct compiler *, slice_ty, expr_context_ty);
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Emission of bytecode is handled by the following macros:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								- ``ADDOP(struct compiler *c, int op)``
							 | 
						||
| 
								 | 
							
								    add 'op' as an opcode
							 | 
						||
| 
								 | 
							
								- ``ADDOP_I(struct compiler *c, int op, int oparg)``
							 | 
						||
| 
								 | 
							
								    add 'op' with an 'oparg' argument
							 | 
						||
| 
								 | 
							
								- ``ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)``
							 | 
						||
| 
								 | 
							
								    add 'op' with the proper argument based on the position of obj in
							 | 
						||
| 
								 | 
							
								    'type', but with no handling of mangled names; used for when you
							 | 
						||
| 
								 | 
							
								    need to do named lookups of objects such as globals, consts, or
							 | 
						||
| 
								 | 
							
								    parameters where name mangling is not possible and the scope of the
							 | 
						||
| 
								 | 
							
								    name is known
							 | 
						||
| 
								 | 
							
								- ``ADDOP_NAME(struct compiler *, int, PyObject *, PyObject *)``
							 | 
						||
| 
								 | 
							
								    just like ADDOP_O, but name mangling is also handled; used for
							 | 
						||
| 
								 | 
							
								    attribute loading or importing based on name
							 | 
						||
| 
								 | 
							
								- ``ADDOP_JABS(struct compiling *c, int op, basicblock b)``
							 | 
						||
| 
								 | 
							
								    create an absolute jump to the basic block 'b'
							 | 
						||
| 
								 | 
							
								- ``ADDOP_JREL(struct compiling *c, int op, basicblock b)``
							 | 
						||
| 
								 | 
							
								    create a relative jump to the basic block 'b'
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Several helper functions that will emit bytecode and are named
							 | 
						||
| 
								 | 
							
								compiler_xx() where xx is what the function helps with (list, boolop
							 | 
						||
| 
								 | 
							
								 etc.).  A rather useful one is::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  static int compiler_nameop(struct compiler *, identifier,
							 | 
						||
| 
								 | 
							
												expr_context_ty);
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								This function looks up the scope of a variable and, based on the
							 | 
						||
| 
								 | 
							
								expression context, emits the proper opcode to load, store, or delete
							 | 
						||
| 
								 | 
							
								the variable.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								As for handling the line number on which a statement is defined, is
							 | 
						||
| 
								 | 
							
								handled by compiler_visit_stmt() and thus is not a worry.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								In addition to emitting bytecode based on the AST node, handling the
							 | 
						||
| 
								 | 
							
								creation of basic blocks must be done.  Below are the macros and
							 | 
						||
| 
								 | 
							
								functions used for managing basic blocks:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								- ``NEW_BLOCK(struct compiler *)``
							 | 
						||
| 
								 | 
							
								    create block and set it as current
							 | 
						||
| 
								 | 
							
								- ``NEXT_BLOCK(struct compiler *)``
							 | 
						||
| 
								 | 
							
								    basically NEW_BLOCK() plus jump from current block
							 | 
						||
| 
								 | 
							
								- ``compiler_new_block(struct compiler *)``
							 | 
						||
| 
								 | 
							
								    create a block but don't use it (used for generating jumps)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Once the CFG is created, it must be flattened and then final emission of
							 | 
						||
| 
								 | 
							
								bytecode occurs.  Flattening is handled using a post-order depth-first
							 | 
						||
| 
								 | 
							
								search.  Once flattened, jump offsets are backpatched based on the
							 | 
						||
| 
								 | 
							
								flattening and then a PyCodeObject file is created.  All of this is
							 | 
						||
| 
								 | 
							
								handled by calling::
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  PyCodeObject * assemble(struct compiler *, int);
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								*** NOTE: if you make a change here that can affect the output of bytecode that
							 | 
						||
| 
								 | 
							
								is already in existence, make sure to delete your old .py(c|o) files!  Running
							 | 
						||
| 
								 | 
							
								``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Code Objects
							 | 
						||
| 
								 | 
							
								------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								In the end, one ends up with a PyCodeObject which is defined in
							 | 
						||
| 
								 | 
							
								Include/code.h .  And with that you now have executable Python bytecode!
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Modified Files
							 | 
						||
| 
								 | 
							
								--------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								+ Parser/
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - Python.asdl
							 | 
						||
| 
								 | 
							
								        ASDL syntax file
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - asdl.py
							 | 
						||
| 
								 | 
							
								        "An implementation of the Zephyr Abstract Syntax Definition
							 | 
						||
| 
								 | 
							
								        Language."  Uses SPARK_ to parse the ASDL files.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - asdl_c.py
							 | 
						||
| 
								 | 
							
								        "Generate C code from an ASDL description."  Generates
							 | 
						||
| 
								 | 
							
									../Python/Python-ast.c and ../Include/Python-ast.h .
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - spark.py
							 | 
						||
| 
								 | 
							
								        SPARK_ parser generator
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								+ Python/
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - Python-ast.c
							 | 
						||
| 
								 | 
							
								        Creates C structs corresponding to the ASDL types.  Also
							 | 
						||
| 
								 | 
							
									contains code for marshaling AST nodes (core ASDL types have
							 | 
						||
| 
								 | 
							
									marshaling code in asdl.c).  "File automatically generated by
							 | 
						||
| 
								 | 
							
									../Parser/asdl_c.py".
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - asdl.c
							 | 
						||
| 
								 | 
							
								        Contains code to handle the ASDL sequence type.  Also has code
							 | 
						||
| 
								 | 
							
								        to handle marshalling the core ASDL types, such as number and
							 | 
						||
| 
								 | 
							
								        identifier.  used by Python-ast.c for marshaling AST nodes.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - ast.c
							 | 
						||
| 
								 | 
							
								        Converts Python's parse tree into the abstract syntax tree.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - compile.txt
							 | 
						||
| 
								 | 
							
								        This file.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - newcompile.c
							 | 
						||
| 
								 | 
							
								        New version of compile.c that handles the emitting of bytecode.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - symtable.c
							 | 
						||
| 
								 | 
							
									Generates symbol table from AST.
							 | 
						||
| 
								 | 
							
									
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								+ Include/
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - Python-ast.h
							 | 
						||
| 
								 | 
							
								        Contains the actual definitions of the C structs as generated by
							 | 
						||
| 
								 | 
							
								        ../Python/Python-ast.c .
							 | 
						||
| 
								 | 
							
								        "Automatically generated by ../Parser/asdl_c.py".
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - asdl.h
							 | 
						||
| 
								 | 
							
								        Header for the corresponding ../Python/ast.c .
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - ast.h
							 | 
						||
| 
								 | 
							
								        Declares PyAST_FromNode() external (from ../Python/ast.c).
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - code.h
							 | 
						||
| 
								 | 
							
									Header file for ../Objects/codeobject.c; contains definition of
							 | 
						||
| 
								 | 
							
									PyCodeObject.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - symtable.h
							 | 
						||
| 
								 | 
							
									Header for ../Python/symtable.c .  struct symtable and
							 | 
						||
| 
								 | 
							
									PySTEntryObject are defined here.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								+ Objects/
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    - codeobject.c
							 | 
						||
| 
								 | 
							
									Contains PyCodeObject-related code (originally in
							 | 
						||
| 
								 | 
							
									../Python/compile.c).
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								ToDo
							 | 
						||
| 
								 | 
							
								----
							 | 
						||
| 
								 | 
							
								*** NOTE: all bugs and patches should be filed on SF under the group
							 | 
						||
| 
								 | 
							
									    "AST" for easy searching.  It also does not hurt to put
							 | 
						||
| 
								 | 
							
									    "[AST]" at the beginning of the subject line of the tracker
							 | 
						||
| 
								 | 
							
									    item.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								+ Stdlib support
							 | 
						||
| 
								 | 
							
								    - AST->Python access?
							 | 
						||
| 
								 | 
							
								    - rewrite compiler package to mirror AST structure?
							 | 
						||
| 
								 | 
							
								+ Documentation
							 | 
						||
| 
								 | 
							
								    - flesh out this doc
							 | 
						||
| 
								 | 
							
									* byte stream output
							 | 
						||
| 
								 | 
							
									* explanation of how the symbol table pass works
							 | 
						||
| 
								 | 
							
									* code object (PyCodeObject)
							 | 
						||
| 
								 | 
							
								+ Universal
							 | 
						||
| 
								 | 
							
								    - make sure entire test suite passes
							 | 
						||
| 
								 | 
							
								    - fix memory leaks
							 | 
						||
| 
								 | 
							
								    - make sure return types are properly checked for errors
							 | 
						||
| 
								 | 
							
								    - no gcc warnings
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								References
							 | 
						||
| 
								 | 
							
								----------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								.. [Aho86] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman.
							 | 
						||
| 
								 | 
							
								   `Compilers: Principles, Techniques, and Tools`,
							 | 
						||
| 
								 | 
							
								   http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								.. [Wang97]  Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris
							 | 
						||
| 
								 | 
							
								   S. Serra.  `The Zephyr Abstract Syntax Description Language.`_
							 | 
						||
| 
								 | 
							
								   In Proceedings of the Conference on Domain-Specific Languages, pp.
							 | 
						||
| 
								 | 
							
								   213--227, 1997.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								.. _The Zephyr Abstract Syntax Description Language.:
							 | 
						||
| 
								 | 
							
								   http://www.cs.princeton.edu/~danwang/Papers/dsl97/dsl97.html
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								.. _SPARK: http://pages.cpsc.ucalgary.ca/~aycock/spark/
							 | 
						||
| 
								 | 
							
								
							 |