해쉬 함수는 O(1)이 맞을까 / Python의 빌트인 해시 함수 찾기

Artiper

|2023. 6. 19. 22:53

해쉬 테이블, 해쉬 자체에 대한 기본적인 설명은 다른 분들의 여러 좋은 포스팅에서도 충분히 다루고 있으니 생략한다. 그리고 언어를 파이썬으로 찾아보게 된건, c++의 stl보다 파이썬의 hash의 코드가 더 읽기 쉬웠기도 하고, 요즘 내가 많이 쓰고 있는 언어가 python이라 이를 기준으로 찾아보게 되었다.

파이썬에서 내부구현된 해시 함수를 찾아보자.

우선 파이썬 공식문서의 Object protocol에서 PyObject_Hash(PyObject *o) 함수에 대한 설명을 찾을 수 있다.

Py_hash_t PyObject_Hash(PyObject *o)

Stable ABI의 일부이며, 객체 o의 해시값을 계산하여 반환한다. 실패하면 -1을 반환한다. 이는 python의 표현식 hash(o)와 동일하다.

Object.c파일에 있는 PyObject의 구현부는 다음과 같다. (python 3.10.12기준)

Py_hash_t
PyObject_Hash(PyObject *v)
{
    PyTypeObject *tp = Py_TYPE(v);
    if (tp->tp_hash != NULL)
        return (*tp->tp_hash)(v);
    /* To keep to the general practice that inheriting
     * solely from object in C code should work without
     * an explicit call to PyType_Ready, we implicitly call
     * PyType_Ready here and then check the tp_hash slot again
     */
    if (tp->tp_dict == NULL) {
        if (PyType_Ready(tp) < 0)
            return -1;
        if (tp->tp_hash != NULL)
            return (*tp->tp_hash)(v);
    }
    /* Otherwise, the object can't be hashed */
    return PyObject_HashNotImplemented(v);
}

주석의 내용을 설명하자면, C 코드에서 객체에서만 상속하는 일반적인 관행을 유지하려면 C 코드가 PyType_Ready의 명시적인 호출없이도 작동해야 하는데, 이를 위해 PyType_Ready를 암시적으로 호출한 다음 tp_hash 슬롯을 다시 확인한다.

(영어 잘 못해서 번역이 틀렸을 수도 있습니다. 혹시 틀리게 해석했다면 바로 잡아주시면 감사드리겠습니다.)

위 코드를 보면 tp_hash가 뭔지에 대해서도 궁금하게 되어, 이를 찾아보면 tp_hash는 include / cpython / object.h파일에 선언되어있다.

    /* More standard operations (here for binary compatibility) */

    hashfunc tp_hash;
    ternaryfunc tp_call;
    reprfunc tp_str;
    getattrofunc tp_getattro;
    setattrofunc tp_setattro;

hashfunc이라는 타입으로 tp_hash를 선언하고 있는데, hashfunc를 찾아보자.

include / Object.h에 hashfunc의 함수포인터에 대해 선언되어 있다.

typedef Py_hash_t (*hashfunc)(PyObject *);

typedef으로 함수 포인터를 선언하는 코드인데, Py_hash_t를 return하며, PyObject의 포인터를 매개변수로 받는 hashfunc 함수 포인터를 선언하는 것이다. 이를 통해 tp_hash는 여러 hash함수들을 가질 수 있게 된다.

그럼 이런 의문이 든다. hash함수가 하나가 아닌가? 왜 tp_hash에 함수포인터를 담아두지?

Py_hash_t
PyObject_Hash(PyObject *v)
{
    PyTypeObject *tp = Py_TYPE(v);
    if (tp->tp_hash != NULL)
        return (*tp->tp_hash)(v);

    if (tp->tp_dict == NULL) {
        if (PyType_Ready(tp) < 0)
            return -1;
        if (tp->tp_hash != NULL)
            return (*tp->tp_hash)(v);
    }
    /* Otherwise, the object can't be hashed */
    return PyObject_HashNotImplemented(v);
}

이 코드를 다시보자. 자세히는 모르겠지만, 잘 짠 코드들은 변수나 함수이름만 보고도 유추할 수 있다. 우선 Py_TYPE()함수에 PyObject를 매개변수로 넣어 PyTypeObject* 타입의 변수를 반환받는다. 그리고 PyObject_Hash()함수는 PyTypeObject변수에서 tp_hash변수에 들어가 있는 hash 함수를 호출한 결과를 return으로 주고 있다.

즉, type별로 Hash함수가 짜져있나라고 추측해볼 수 있다. 그럼 PyTypeObject에서 tp_hash를 찾으면 무언가 알 수 있게 되지 않을까?

tp_hash

우선 tp_hash를 다루는 파이썬 공식문서의 설명을 읽어보자. 해당 공식문서의 설명은 다음과 같다.

built-in function hash()를 구현하는 함수에 대한 optional pointer이다.

signature는 PyObject_Hash()와 동일하다.

Value -1은 정상적인 반환 값으로 반환되어서는 안 되며, 해시 값을 계산하는 동안 오류가 발생하면 함수는 예외를 설정하고 -1을 반환해야 한다.

이 필드가 설정되지 않은 경우(그리고 tp_richcompare가 설정되지 않은 경우) 객체의 해시를 가져 오려고 시도하면 TypeError가 발생한다. 이는 PyObject_HashNotImplemented()로 설정하는 것과 동일하다.

이 필드를 명시적으로 PyObject_HashNotImplemented()로 설정하여 부모 타입에서 해시 메서드의 상속을 차단할 수 있다. 이는 파이썬 수준에서 __hash__ = None과 동등한 것으로 해석되어 isinstance(o, collections.Hashtable)가 올바르게 False를 반환하도록 한다. 그 반대의 경우도 마찬가지이다. Python 수준에서 클래스에서 __hash__ = None을 설정하면 tp_hash 슬롯이 PyObject_HashNotImplemented()로 설정된다.

상속:

Group: tp_hash, tp_richcompare

이 필드는 subtype이 tp_richcompare와 함께 상속받는다. subtype의 tp_richcompare와 tp_hash가 모두 NULL일 때는 subtype이 tp_richcompare와 tp_hash 모두 상속받는다.

tp_hash와 PyTypeObject를 찾자.

위는 단순히 공식문서의 설명이었으며 PyTypeObject에 구현된 tp_hash를 찾기 위해 PyTypeObject를 먼저 찾아보자.

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
} PyObject

찾던 중 재밌는 코드가 나왔는데, Python의 Object는 reference count와 PyTypeObject로 구성되는 것을 볼 수 있다.

어쨌든 파이썬 3.10.12에서 Include / object.h 파일에 PyTypeObject가 선언된 것을 볼 수 있다.

/* PyTypeObject structure is defined in cpython/object.h.
   In Py_LIMITED_API, PyTypeObject is an opaque structure. */
typedef struct _typeobject PyTypeObject;

cpython의 object.h로 가면 PyTypeObject의 structure를 볼 수 있다고 하니 이동한다.

// If this structure is modified, Doc/includes/typestruct.h should be updated
// as well.
struct _typeobject {
    PyObject_VAR_HEAD
    const char *tp_name; /* For printing, in format "<module>.<name>" */
    Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation */

    /* Methods to implement standard operations */

    destructor tp_dealloc;
    Py_ssize_t tp_vectorcall_offset;
    getattrfunc tp_getattr;
    setattrfunc tp_setattr;
    PyAsyncMethods *tp_as_async; /* formerly known as tp_compare (Python 2)
                                    or tp_reserved (Python 3) */
    reprfunc tp_repr;

    /* Method suites for standard classes */

    PyNumberMethods *tp_as_number;
    PySequenceMethods *tp_as_sequence;
    PyMappingMethods *tp_as_mapping;

    /* More standard operations (here for binary compatibility) */

    hashfunc tp_hash;
    ternaryfunc tp_call;
    reprfunc tp_str;
    getattrofunc tp_getattro;
    setattrofunc tp_setattro;

    /* Functions to access object as input/output buffer */
    PyBufferProcs *tp_as_buffer;

    /* Flags to define presence of optional/expanded features */
    unsigned long tp_flags;

    const char *tp_doc; /* Documentation string */

    /* Assigned meaning in release 2.0 */
    /* call function for all accessible objects */
    traverseproc tp_traverse;

    /* delete references to contained objects */
    inquiry tp_clear;

    /* Assigned meaning in release 2.1 */
    /* rich comparisons */
    richcmpfunc tp_richcompare;

    /* weak reference enabler */
    Py_ssize_t tp_weaklistoffset;

    /* Iterators */
    getiterfunc tp_iter;
    iternextfunc tp_iternext;

    /* Attribute descriptor and subclassing stuff */
    struct PyMethodDef *tp_methods;
    struct PyMemberDef *tp_members;
    struct PyGetSetDef *tp_getset;
    // Strong reference on a heap type, borrowed reference on a static type
    struct _typeobject *tp_base;
    PyObject *tp_dict;
    descrgetfunc tp_descr_get;
    descrsetfunc tp_descr_set;
    Py_ssize_t tp_dictoffset;
    initproc tp_init;
    allocfunc tp_alloc;
    newfunc tp_new;
    freefunc tp_free; /* Low-level free-memory routine */
    inquiry tp_is_gc; /* For PyObject_IS_GC */
    PyObject *tp_bases;
    PyObject *tp_mro; /* method resolution order */
    PyObject *tp_cache;
    PyObject *tp_subclasses;
    PyObject *tp_weaklist;
    destructor tp_del;

    /* Type attribute cache version tag. Added in version 2.6 */
    unsigned int tp_version_tag;

    destructor tp_finalize;
    vectorcallfunc tp_vectorcall;
};

뭔가 잘 찾아온 것 같다.

hashfunc tp_hash;

그리고 찾던 tp_hash를 찾았는데.. 생각해보니 도르마무였다. 위에서 분명 cpython의 object.h에서 tp_hash를 봤었는데 찾은게 이거였다니..

다시 목표를 상기하자면 tp_hash에 파이썬의 오브젝트 타입마다 각기 구현된 hash함수가 들어가는 것이 나의 추측이었고, 해당 hash함수의 구현부를 보는 것이 나의 목표였다. 그럼 tp_hash에 값을 넣어주는 부분을 한 번 찾아보자.

이 코드를 볼 때, 위에서 다룬 tp_hash 공식문서 설명을 읽은게 꽤나 도움이 되었다.

tp_hash에 값을 대입하거나, tp_hash를 다루는 부분을 찾기위해 tp_hash를 키워드로 두고 검색해서 하나씩 읽고 있었는데, collectionsmodule.c 이라는 파일이 눈에 띄여서 멈춰보았다. 여기라면 PyTypeObject를 사용하여 type선언을 한 것을 확인할 수 있지 않을까라는 기대 때문이었다.

운좋게 deque_type이라는 것을 PyTypeObject 구조체로 선언하는 것을 확인할 수 있었고, tp_hash에는 PyObject_HashNotImplemented 함수 포인터를 넣어줌으로 부모 타입에서 해시 메소드의 상속을 차단했다.

위 코드까지 접하고 나니 타입을 선언할 때 hash함수를 넣어주는 형태면, 타입을 선언하는 곳을 중점적으로 보면 hash함수의 구현부분에 다다를 수 있지 않을까라고 생각했다. 그리고 ` PyObject_HashNotImplemented()` 함수로 Hash를 상속받는 것을 막는다는 뜻은, 부모 클래스 격의 코드에서 Hash함수가 구현되어 있지 않을까도 생각했다.

즉, 다음과 같이 목표를 재정리했다.

타입을 선언하는 곳을 중점적으로 보자.
deque_type의 부모 클래스 격 코드가 존재하면 거기에 hash함수가 구현되어 있는지 살펴보자.

진짜 hash함수 어디있냐?

타입을 선언하는 곳을 중점적으로 보자.(tp_hash를 키워드로 넣어서 찾자.)
deque_type의 타입의 부모 클래스격의 코드가 있으면 거기에 hash함수가 구현되어있는지 살펴보자 정도로 목표를 세분화했다.

우선 2번은 아닌 것 같아서 1번을 시도한다. 2번이 아닌 이유는 deque_type에서 tp_base에 값이 0으로 설정되어있는데, 이는 무언가를 상속받는다고 생각하긴 어려운 코드였다. 하물며 0으로 설정되어 자동으로 PyObject를 상속받는다고 해도,

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
} PyObject;

PyObject는 이렇게 생겨서, 실질적인 구현언 PyTypeObject에 다 있을 것이기에 PyObject는 그저 모든 형을 담기 위한 뼈대 정도로 보여 큰 의미가 없어보였다.

드디어 어떤 타입이든 해시 함수가 구현된 구현부 정도는 찾을 수 있게 되었다.

그리고 드디어 primitive type에 대한 hash function을 발견했다. (여기선 float)

즉, 파이썬의 객체의 타입마다 hash가 필요한 경우 각 타입에 맞게 구현된 해쉬를 사용하는 것을 알 수 있다.

너무 돌아왔지만, 이 글의 주제는 해시 함수가 O(1)이 맞는 것인가에 대한 고찰이다. 즉, 아직 해시 함수에 대한 분석을 하지 않았고, 이에 대한 간단한 코드 분석을 해보자. int, string, float등 각기 구현된 해시 함수를 살펴볼 예정이다.

해시 함수가 O(1)이 맞을까?

int의 hash function

우선 PyLong_Type에 관한 공식문서를 읽어보면,PyTypeObject의 인스턴스 PyLong_Type은 파이썬의 int와 같은 객체이다. 즉 파이썬의 class int와 같다.

int객체에서 사용하는 hash함수의 구현 내용은 다음과 같다.

static Py_hash_t
long_hash(PyLongObject *v)
{
    Py_uhash_t x;
    Py_ssize_t i;
    int sign;

    i = Py_SIZE(v);
    switch(i) {
    case -1: return v->ob_digit[0]==1 ? -2 : -(sdigit)v->ob_digit[0];
    case 0: return 0;
    case 1: return v->ob_digit[0];
    }
    sign = 1;
    x = 0;
    if (i < 0) {
        sign = -1;
        i = -(i);
    }
    while (--i >= 0) {
        /* Here x is a quantity in the range [0, _PyHASH_MODULUS); we
           want to compute x * 2**PyLong_SHIFT + v->ob_digit[i] modulo
           _PyHASH_MODULUS.

           The computation of x * 2**PyLong_SHIFT % _PyHASH_MODULUS
           amounts to a rotation of the bits of x.  To see this, write

             x * 2**PyLong_SHIFT = y * 2**_PyHASH_BITS + z

           where y = x >> (_PyHASH_BITS - PyLong_SHIFT) gives the top
           PyLong_SHIFT bits of x (those that are shifted out of the
           original _PyHASH_BITS bits, and z = (x << PyLong_SHIFT) &
           _PyHASH_MODULUS gives the bottom _PyHASH_BITS - PyLong_SHIFT
           bits of x, shifted up.  Then since 2**_PyHASH_BITS is
           congruent to 1 modulo _PyHASH_MODULUS, y*2**_PyHASH_BITS is
           congruent to y modulo _PyHASH_MODULUS.  So

             x * 2**PyLong_SHIFT = y + z (mod _PyHASH_MODULUS).

           The right-hand side is just the result of rotating the
           _PyHASH_BITS bits of x left by PyLong_SHIFT places; since
           not all _PyHASH_BITS bits of x are 1s, the same is true
           after rotation, so 0 <= y+z < _PyHASH_MODULUS and y + z is
           the reduction of x*2**PyLong_SHIFT modulo
           _PyHASH_MODULUS. */
        x = ((x << PyLong_SHIFT) & _PyHASH_MODULUS) |
            (x >> (_PyHASH_BITS - PyLong_SHIFT));
        x += v->ob_digit[i];
        if (x >= _PyHASH_MODULUS)
            x -= _PyHASH_MODULUS;
    }
    x = x * sign;
    if (x == (Py_uhash_t)-1)
        x = (Py_uhash_t)-2;
    return (Py_hash_t)x;
}

주석이 가독성을 방해할 수도 있으니, 주석을 제거한 버전의 코드도 올린다.

static Py_hash_t
long_hash(PyLongObject *v)
{
    Py_uhash_t x;
    Py_ssize_t i;
    int sign;

    i = Py_SIZE(v);
    switch(i) {
    case -1: return v->ob_digit[0]==1 ? -2 : -(sdigit)v->ob_digit[0];
    case 0: return 0;
    case 1: return v->ob_digit[0];
    }
    sign = 1;
    x = 0;
    if (i < 0) {
        sign = -1;
        i = -(i);
    }
    while (--i >= 0) {
        x = ((x << PyLong_SHIFT) & _PyHASH_MODULUS) |
            (x >> (_PyHASH_BITS - PyLong_SHIFT));
        x += v->ob_digit[i];
        if (x >= _PyHASH_MODULUS)
            x -= _PyHASH_MODULUS;
    }
    x = x * sign;
    if (x == (Py_uhash_t)-1)
        x = (Py_uhash_t)-2;
    return (Py_hash_t)x;
}

Str의 hash_function

/* Believe it or not, this produces the same value for ASCII strings
   as bytes_hash(). */
static Py_hash_t
unicode_hash(PyObject *self)
{
    Py_uhash_t x;  /* Unsigned for defined overflow behavior. */

#ifdef Py_DEBUG
    assert(_Py_HashSecret_Initialized);
#endif
    if (_PyUnicode_HASH(self) != -1)
        return _PyUnicode_HASH(self);
    if (PyUnicode_READY(self) == -1)
        return -1;

    x = _Py_HashBytes(PyUnicode_DATA(self),
                      PyUnicode_GET_LENGTH(self) * PyUnicode_KIND(self));
    _PyUnicode_HASH(self) = x;
    return x;
}

'Computer Science > Data Structure' 카테고리의 다른 글

[자료 구조] 배열로 구현한 스택 (0)	2020.09.14