Commit graph

386 commits

Author SHA1 Message Date
Marc-André Lemburg
80d1dd5f3b Fix for bug #444493: u'\U00010001' segfaults with current CVS on
wide builds.
2001-07-25 16:05:59 +00:00
Marc-André Lemburg
6c6bfb7c70 Make the unicode-escape and the UTF-16 codecs handle surrogates
correctly and thus roundtrip-safe.

Some minor cleanups of the code.

Added tests for the roundtrip-safety.
2001-07-20 17:39:11 +00:00
Martin v. Löwis
ce9b5a55e1 Encode surrogates in UTF-8 even for a wide Py_UNICODE.
Implement sys.maxunicode.
Explicitly wrap around upper/lower computations for wide Py_UNICODE.
When decoding large characters with UTF-8, represent expected test
results using the \U notation.
2001-06-27 06:28:56 +00:00
Tim Peters
2f228e75e4 Get rid of the superstitious "~" in dict hashing's "i = (~hash) & mask".
The comment following used to say:
	/* We use ~hash instead of hash, as degenerate hash functions, such
	   as for ints <sigh>, can have lots of leading zeros. It's not
	   really a performance risk, but better safe than sorry.
	   12-Dec-00 tim:  so ~hash produces lots of leading ones instead --
	   what's the gain? */
That is, there was never a good reason for doing it.  And to the contrary,
as explained on Python-Dev last December, it tended to make the *sum*
(i + incr) & mask (which is the first table index examined in case of
collison) the same "too often" across distinct hashes.

Changing to the simpler "i = hash & mask" reduced the number of string-dict
collisions (== # number of times we go around the lookup for-loop) from about
6 million to 5 million during a full run of the test suite (these are
approximate because the test suite does some random stuff from run to run).
The number of collisions in non-string dicts also decreased, but not as
dramatically.

Note that this may, for a given dict, change the order (wrt previous
releases) of entries exposed by .keys(), .values() and .items().  A number
of std tests suffered bogus failures as a result.  For dicts keyed by
small ints, or (less so) by characters, the order is much more likely to be
in increasing order of key now; e.g.,

>>> d = {}
>>> for i in range(10):
...    d[i] = i
...
>>> d
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}
>>>

Unfortunately. people may latch on to that in small examples and draw a
bogus conclusion.

test_support.py
    Moved test_extcall's sortdict() into test_support, made it stronger,
    and imported sortdict into other std tests that needed it.
test_unicode.py
    Excluced cp875 from the "roundtrip over range(128)" test, because
    cp875 doesn't have a well-defined inverse for unicode("?", "cp875").
    See Python-Dev for excruciating details.
Cookie.py
    Chaged various output functions to sort dicts before building
    strings from them.
test_extcall
    Fiddled the expected-result file.  This remains sensitive to native
    dict ordering, because, e.g., if there are multiple errors in a
    keyword-arg dict (and test_extcall sets up many cases like that), the
    specific error Python complains about first depends on native dict
    ordering.
2001-05-13 00:19:31 +00:00
Marc-André Lemburg
542fe56cb9 Fix for bug #417030: "print '%*s' fails for unicode string" 2001-05-02 14:21:53 +00:00
Marc-André Lemburg
ef0a032883 Patch by Finn Bock to make test_unicode.py work for Jython. 2001-02-10 14:09:31 +00:00
Marc-André Lemburg
fde66e1bcc Fixed .capitalize() method of Unicode objects to work like the
corresponding string method. Added tests for this too.

Patch written by Marc-Andre Lemburg. Copyright assigned to Guido van Rossum.
2001-01-29 11:14:16 +00:00
Guido van Rossum
a1374e429b Change verify() function to raise TestFailed, not AssertionError.
(I realize that I didn't really test this, because all the tests
succeed, so verify() never raised an AssertionError -- but the test
suite still succeeds, so I'm not too worried.)
2001-01-19 19:01:56 +00:00
Tim Peters
d2bf3b7ca6 Whitespace normalization. Leaving tokenize_tests.py alone for now. 2001-01-18 02:22:22 +00:00
Marc-André Lemburg
3661908a6a This patch removes all uses of "assert" in the regression test suite
and replaces them with a new API verify(). As a result the regression
suite will also perform its tests in optimization mode.

Written by Marc-Andre Lemburg. Copyright assigned to Guido van Rossum.
2001-01-17 19:11:13 +00:00
Marc-André Lemburg
3a645e4dd4 Added checks to prevent PyUnicode_Count() from dumping core
in case the parameters are out of bounds and fixes error handling
for .count(), .startswith() and .endswith() for the case of
mixed string/Unicode objects.

This patch adds Python style index semantics to PyUnicode_Count()
indices (including the special handling of negative indices).

The patch is an extended version of patch #103249 submitted
by Michael Hudson (mwh) on SF. It also includes new test cases.
2001-01-16 11:54:12 +00:00
Marc-André Lemburg
a866df806d This patch changes the default behaviour of the builtin charmap
codec to not apply Latin-1 mappings for keys which are not found
in the mapping dictionaries, but instead treat them as undefined
mappings.

The patch was originally written by Martin v. Loewis with some
additional (cosmetic) changes and an updated test script
by Marc-Andre Lemburg.

The standard codecs were recreated from the most current files
available at the Unicode.org site using the Tools/scripts/gencodec.py
tool.

This patch closes the bugs #116285 and #119960.
2001-01-03 21:29:14 +00:00
Guido van Rossum
8b26454273 Test more split argument combinations:
1) multi-char separator
2) multi-char separator that only occurs at last position
3) all of the above with mixed Unicode and 8-bit-string arguments
2000-12-19 02:22:31 +00:00
Guido van Rossum
15ffc71c0f Slight improvement to Unicode test suite, inspired by patch #102563:
also test join method of 8-bit strings.

Also changed the test() function to (1) compare the types of the
expected and actual result, and (2) in verbose mode, print the repr()
of the output.
2000-11-29 12:13:59 +00:00
Fred Drake
004d5e6880 Make reindent.py happy (convert everything to 4-space indents!). 2000-10-23 17:22:08 +00:00
Marc-André Lemburg
b96d80201c Updated test with a case which checks for the bug reported in 2000-10-07 08:52:45 +00:00
Marc-André Lemburg
e5034378cc Removing UTF-16 aware Unicode comparison code. This kind of compare
function (together with other locale aware ones) should into a new collation
support module. See python-dev for a discussion of this removal.

Note: This patch should also be applied to the 1.6 branch.
2000-08-08 08:04:29 +00:00
Marc-André Lemburg
d6d06ade26 Tests for new surrogate support in the UTF-8 codec. By Bill Tutt. 2000-07-07 17:48:52 +00:00
Marc-André Lemburg
b6d78fcd9c Tests for new instance support in unicode(). 2000-07-07 13:46:19 +00:00
Marc-André Lemburg
9d4674168f Added tests for the new .isalpha() and .isalnum() methods. 2000-07-05 09:46:40 +00:00
Marc-André Lemburg
af69f15d21 Marc-Andre Lemburg <mal@lemburg.com>:
Moved tests of new Unicode Char Name support to a separate test.
2000-06-30 09:13:35 +00:00
Marc-André Lemburg
a6f73d64c5 Marc-Andre Lemburg <mal@lemburg.com>:
Added tests for the new Unicode character name support in the
standard unicode-escape codec.
2000-06-28 16:41:23 +00:00
Marc-André Lemburg
bddf502a1f Marc-Andre Lemburg <mal@lemburg.com>:
Removed a test which can fail when the default locale setting
uses a Latin-1 encoding. The test case is not applicable anymore.
2000-06-14 09:17:25 +00:00
Marc-André Lemburg
8462573826 Marc-Andre Lemburg <mal@lemburg.com>:
Fixed some tests to not cause the script to fail, but rather
output a warning (which then is caught by regrtest.py as wrong
output). This is needed to make test_unicode.py run through
on JPython.
Thanks to Finn Bock.
2000-06-13 12:05:36 +00:00
Marc-André Lemburg
59a044b7d2 Marc-Andre Lemburg <mal@lemburg.com>:
Updated to the fix in %c formatting: it now always checks for
a one character argument.
2000-06-08 17:50:55 +00:00
Fred Drake
774c931c12 M.-A. Lemburg <mal@lemburg.com>:
Added another test for string formatting (the one that
produced the core dump now fixed in unicodeobject.c).
2000-05-09 19:57:46 +00:00
Guido van Rossum
6650320349 Get rid of memory leak caused by assingning sys.exc_info() to a local.
Store sys.exc_info()[:2] instead.
2000-04-28 20:39:58 +00:00
Fred Drake
e0243e24be M.-A. Lemburg <mal@lemburg.com>:
Added test for Unicode string concatenation.
2000-04-13 14:11:56 +00:00
Guido van Rossum
7ee801d6af Marc-Andre Lemburg:
Modified .splitlines() tests according to the changes
in unicodeobject.c.
2000-04-11 15:37:02 +00:00
Guido van Rossum
9706486b9f Marc-Andre Lemburg:
* '...%s...' % u"abc" now coerces to Unicode just like
  string methods. Care is taken not to reevaluate already formatted
  arguments -- only the first Unicode object appearing in the
  argument mapping is looked up twice. Added test cases for
  this to test_unicode.py.
2000-04-10 13:52:48 +00:00
Guido van Rossum
9e896b37c7 Marc-Andre's third try at this bulk patch seems to work (except that
his copy of test_contains.py seems to be broken -- the lines he
deleted were already absent).  Checkin messages:


New Unicode support for int(), float(), complex() and long().

- new APIs PyInt_FromUnicode() and PyLong_FromUnicode()
- added support for Unicode to PyFloat_FromString()
- new encoding API PyUnicode_EncodeDecimal() which converts
  Unicode to a decimal char* string (used in the above new
  APIs)
- shortcuts for calls like int(<int object>) and float(<float obj>)
- tests for all of the above

Unicode compares and contains checks:
- comparing Unicode and non-string types now works; TypeErrors
  are masked, all other errors such as ValueError during
  Unicode coercion are passed through (note that PyUnicode_Compare
  does not implement the masking -- PyObject_Compare does this)
- contains now works for non-string types too; TypeErrors are
  masked and 0 returned; all other errors are passed through

Better testing support for the standard codecs.

Misc minor enhancements, such as an alias dbcs for the mbcs codec.

Changes:
- PyLong_FromString() now applies the same error checks as
  does PyInt_FromString(): trailing garbage is reported
  as error and not longer silently ignored. The only characters
  which may be trailing the digits are 'L' and 'l' -- these
  are still silently ignored.
- string.ato?() now directly interface to int(), long() and
  float(). The error strings are now a little different, but
  the type still remains the same. These functions are now
  ready to get declared obsolete ;-)
- PyNumber_Int() now also does a check for embedded NULL chars
  in the input string; PyNumber_Long() already did this (and
  still does)

Followed by:

Looks like I've gone a step too far there... (and test_contains.py
seem to have a bug too).

I've changed back to reporting all errors in PyUnicode_Contains()
and added a few more test cases to test_contains.py (plus corrected
the join() NameError).
2000-04-05 20:11:21 +00:00
Guido van Rossum
24bdb0474f Marc-Andre Lemburg:
The attached patch set includes a workaround to get Python with
Unicode compile on BSDI 4.x (courtesy Thomas Wouters; the cause
is a bug in the BSDI wchar.h header file) and Python interfaces
for the MBCS codec donated by Mark Hammond.

Also included are some minor corrections w/r to the docs of
the new "es" and "es#" parser markers (use PyMem_Free() instead
of free(); thanks to Mark Hammond for finding these).

The unicodedata tests are now in a separate file
(test_unicodedata.py) to avoid problems if the module cannot
be found.
2000-03-28 20:29:59 +00:00
Guido van Rossum
d8855fde88 Marc-Andre Lemburg:
Attached you find the latest update of the Unicode implementation.
The patch is against the current CVS version.

It includes the fix I posted yesterday for the core dump problem
in codecs.c (was introduced by my previous patch set -- sorry),
adds more tests for the codecs and two new parser markers
"es" and "es#".
2000-03-24 22:14:19 +00:00
Barry Warsaw
51ac58039f On 17-Mar-2000, Marc-Andre Lemburg said:
Attached you find an update of the Unicode implementation.

    The patch is against the current CVS version. I would appreciate
    if someone with CVS checkin permissions could check the changes
    in.

    The patch contains all bugs and patches sent this week and also
    fixes a leak in the codecs code and a bug in the free list code
    for Unicode objects (which only shows up when compiling Python
    with Py_DEBUG; thanks to MarkH for spotting this one).
2000-03-20 16:36:48 +00:00
Guido van Rossum
d4d2684240 Marc-Andre Lemburg: Add tests for mixed use of char in string. 2000-03-13 23:21:48 +00:00
Guido van Rossum
a831cac7a8 Marc-Andre Lemburg: test script for Unicode implementation. 2000-03-10 23:23:21 +00:00