# Mixing Python with C

A common point of rhetoric up the sleeve of any Pythonista is this: "...and you can always write the slow parts in C if you have to!" It's typically said off-to-the-side and rarely elaborated on.

It's not a bad piece of rhetoric, I even use it from time to time. But it's only useful on programmers who haven't stepped much beyond their own narrow interest in technology--anyone who's done work across a variety of languages ought to realize that "can interface with C" isn't a feature that's supposed to be marketable, it's a hard requirement. They may have never had to do this themselves, but they should at least know in principle it can be done. In practice it's often a lot more difficult than it should be. If Python's C-interfacing capabilities were as slick as Clojure's Java interfacing ones, I wouldn't want to write about it.

What do I mean by "can interface with C"? A language that can interface with C is a language that can directly call functions in a compiled shared object that was written in C, and also where C can get at the language's internals as well to call its code and have interactions. This is distinct from using the other language to instruct the operating system to run a compiled C program and give it the results, and vice versa. Java interfaces with C, and Python interfaces with C.

The general caveats of a natively-compiled language apply if you decide to mix languages like Python and C: the ideal for C is "write once, compile anywhere" rather than Python's "write once, run anywhere"; as such, it's a lot harder to achieve the ideal. The easiest way to do C programming is to pick a platform and don't support others unless you have to. If you want to support others, make sure you really need a natively-compiled language before starting.

With Python specifically, there are other caveats (like no guarantees on mutability, e.g. turning 4 into 5) but also there are other considerations as well before you decide to dive into C-land and abandon "write once, run anywhere".

Have you tried running your program with PyPy? It's super-fast! Sometimes even faster than hand-crafted C.

Have you tried Cython? If you're doing this sort of thing professionally, you probably should use it unless you're a Master C programmer and like doing things the long way.

Is it fast enough (or even faster) to just spawn a subprocess for the OS than talk to the code directly?

Have you made sure your code is using the most efficient algorithms?

Can you throw another machine/core at the problem?

Have you tried SWIG?

Maybe you're using a lot of libraries and you understand that "write once, run anywhere" is a fool's ideal. I like to approach the ideal as closely as I can, but I do admit when it's not worth the effort and free myself up to take advantage of delicious platform assumptions. (We're shipping my dev machine, right?)

Alright, enough chat, let's get to some examples. I decided to do a benchmark test on the SuperFast hash function. Why? It seemed like an okay idea.

Here's a benchmark for the C:

/* By Paul Hsieh (C) 2004, 2005.  Covered under the Paul Hsieh derivative

http://www.azillionmonkeys.com/qed/hash.html */
#include <stdint.h>
#define NULL (void*)0
#undef get16bits
#if (defined(__GNUC__) && defined(__i386__)) || defined(__WATCOMC__) \
|| defined(_MSC_VER) || defined (__BORLANDC__) || defined (__TURBOC__)
#define get16bits(d) (*((const uint16_t *) (d)))
#endif

#if !defined (get16bits)
#define get16bits(d) ((((uint32_t)(((const uint8_t *)(d))[1])) << 8)\
+(uint32_t)(((const uint8_t *)(d))[0]) )
#endif

uint32_t SuperFastHash (const char * data, int len) {
uint32_t hash = len, tmp;
int rem;

if (len <= 0 || data == NULL) return 0;

rem = len & 3;
len >>= 2;

/* Main loop */
for (;len > 0; len--) {
hash  += get16bits (data);
tmp    = (get16bits (data+2) << 11) ^ hash;
hash   = (hash << 16) ^ tmp;
data  += 2*sizeof (uint16_t);
hash  += hash >> 11;
}

/* Handle end cases */
switch (rem) {
case 3: hash += get16bits (data);
hash ^= hash << 16;
hash ^= ((signed char)data[sizeof (uint16_t)]) << 18;
hash += hash >> 11;
break;
case 2: hash += get16bits (data);
hash ^= hash << 11;
hash += hash >> 17;
break;
case 1: hash += (signed char)*data;
hash ^= hash << 10;
hash += hash >> 1;
}

/* Force "avalanching" of final 127 bits */
hash ^= hash << 3;
hash += hash >> 5;
hash ^= hash << 4;
hash += hash >> 17;
hash ^= hash << 25;
hash += hash >> 6;

return hash;
}

/* Jach's Benchmark */

#include <time.h>
#include <stdio.h>
#include <unistd.h>

int main(void) {
struct timespec start, end;
long secs;
long hashes = 0;
char data[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20};
clock_gettime(CLOCK_MONOTONIC, &start);
clock_gettime(CLOCK_MONOTONIC, &end);
while ((secs = end.tv_sec - start.tv_sec) < 20) {
uint32_t hash = SuperFastHash(data, 20);
data[hash % 20] += 1;
clock_gettime(CLOCK_MONOTONIC, &end);
++hashes;
}

printf("secs: %ld, hashes: %ld, hashes/sec: %f, Khashes/sec: %f\n", secs, hashes, hashes/20.0,
hashes/20.0/1000.0);
return 0;
}


Compiled without optimization: gcc -lrt hash_test.c (Don't forget the time library for proper time keeping.)

Results on a single thread on my "Intel(R) Core(TM) i7 CPU 920" underclocked to 2.13GHz: secs: 20, hashes: 160503014, hashes/sec: 8025150.700000, Khashes/sec: 8025.150700

With -O3 optimization: secs: 20, hashes: 598883583, hashes/sec: 29944179.150000, Khashes/sec: 29944.179150

Wow, nearly 4 times faster!

Here's a Python version I adapted from this dude that I assume works.

# -*- coding: utf-8 -*-
# Copyright (c) 2012, Jean-Rémy Bancel <jean-remy.bancel@telecom-paristech.org>
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of the Chromagon Project nor the
# names of its contributors may be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL Jean-Rémy Bancel BE LIABLE FOR ANY
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
import binascii

def get16bits(data):
"""Returns the first 16bits of a string"""
return int(binascii.hexlify(str(data[1::-1])), 16)

def superFastHash(data):
hash = length = len(data)
if length == 0:
return 0

rem = length & 3
length >>= 2

while length > 0:
hash += get16bits(data) & 0xFFFFFFFF
tmp = (get16bits(data[2:])<< 11) ^ hash
hash = ((hash << 16) & 0xFFFFFFFF) ^ tmp
data = data[4:]
hash += hash >> 11
hash = hash & 0xFFFFFFFF
length -= 1

if rem == 3:
hash += get16bits (data)
hash ^= (hash << 16) & 0xFFFFFFFF
hash ^= (int(binascii.hexlify(data[2]), 16) << 18) & 0xFFFFFFFF
hash += hash >> 11
elif rem == 2:
hash += get16bits (data)
hash ^= (hash << 11) & 0xFFFFFFFF
hash += hash >> 17
elif rem == 1:
hash += int(binascii.hexlify(data[0]), 16)
hash ^= (hash << 10) & 0xFFFFFFFF
hash += hash >> 1

hash = hash & 0xFFFFFFFF
hash ^= (hash << 3) & 0xFFFFFFFF
hash += hash >> 5
hash = hash & 0xFFFFFFFF
hash ^= (hash << 4) & 0xFFFFFFFF
hash += hash >> 17
hash = hash & 0xFFFFFFFF
hash ^= (hash << 25) & 0xFFFFFFFF
hash += hash >> 6
hash = hash & 0xFFFFFFFF

return hash

# Jach's Benchmark

import time

def main():
hashes = 0
data = bytearray([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
start = time.clock()
secs = time.clock() - start
while secs < 20:
superFastHash(data);
secs = time.clock() - start
hashes += 1

print "secs: %ld, hashes: %ld, hashes/sec: %f, Khashes/sec: %f\n" % (secs, hashes, hashes/20.0, hashes/20.0/1000.0)

if __name__ == '__main__':
main()


Results with Python 2.7.5: secs: 20, hashes: 636858, hashes/sec: 31842.900000, Khashes/sec: 31.842900

With PyPy 2.0.2: secs: 20, hashes: 6254158, hashes/sec: 312707.900000, Khashes/sec: 312.707900

So PyPy is about 10 times faster than regular CPython, but still 100 times slower than C! How can we make our Python version acceptable? By getting rid of the Python and calling the C version.

There are several ways we can approach this. I like the simplest way, but it is perhaps the most unsafe, and that's compiling a .so file and talking to it directly through Python's ctypes.

gcc -lrt -O3 -fPIC -shared hash_test.c -o super.so

And the Python:

from ctypes import *
def main2():
hashes = 0
data = (c_char_p*20)(*[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
start = time.clock()
secs = time.clock() - start
while secs < 20:
lib.SuperFastHash(data, 20)
secs = time.clock() - start
hashes += 1

print "secs: %ld, hashes: %ld, hashes/sec: %f, Khashes/sec: %f\n" % (secs, hashes, hashes/20.0, hashes/20.0/1000.0)

if __name__ == '__main__':
#main()
main2()


Results (just with Python2.7 again): secs: 20, hashes: 11191554, hashes/sec: 559577.700000, Khashes/sec: 559.577700

Still not fast enough. What's the slowdown? Individually hashes are computed at blazing speed, but in the benchmark we're actually limited by Python's looping speed! Indeed, taking out the lib.SuperFashHash C call entirely gives results: secs: 20, hashes: 28396765, hashes/sec: 1419838.250000, Khashes/sec: 1419.838250

Doing 'nothing' is only twice as fast as doing something. Isn't that just sad? Explicit loops are your enemy in Python.

Let's just call C's benchmark, eh? Change the call from main2() to lib.main(), and... secs: 20, hashes: 338130043, hashes/sec: 16906502.150000, Khashes/sec: 16906.502150

Yay. But it's still about half as slow as calling the C program directly. Why? And does this happen if we actually call the C program as a subprocess? Replace the call to lib.main() with import subprocess; print subprocess.check_output('./a.out') and we get the results: secs: 20, hashes: 643060473, hashes/sec: 32153023.650000, Khashes/sec: 32153.023650. i.e. the expected full speed results.

So what the hell is going on with the direct call? Before I get into that, let's go into C interaction with Python even deeper. It's nice that we can directly talk to shared object files this way, even if a little bug prone and clunky (and mysteriously slow), but it'd be even nicer if we could talk to this C code in a natural Pythonic way, as if it were any other library. To do that, we need to extend the C code and make use of Python.h. What I've done below is add a benchmark function that's equivalent to the main() function, but returns the summary string as a Python object instead of printing it.

#include <python2.7/Python.h>

static PyObject* hashtest_benchmark(PyObject* self, PyObject* args) {
struct timespec start, end;
long secs;
long hashes = 0;
char data[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20};
clock_gettime(CLOCK_MONOTONIC, &start);
clock_gettime(CLOCK_MONOTONIC, &end);
while ((secs = end.tv_sec - start.tv_sec) < 20) {
uint32_t hash = SuperFastHash(data, 20);
data[hash % 20] += 1;
clock_gettime(CLOCK_MONOTONIC, &end);
++hashes;
}

char buf[100];
sprintf(buf, "secs: %ld, hashes: %ld, hashes/sec: %f, Khashes/sec: %f\n", secs, hashes, hashes/20.0, hashes/20.0/1000.0);
return PyString_FromString(buf);
}

static PyMethodDef hashtestMethods[] = {
{"benchmark", hashtest_benchmark, METH_VARARGS, "Run Benchmark."},
{NULL, NULL, 0, NULL}
};

void inithashtest(void) {
(void) Py_InitModule("hashtest", hashtestMethods);
}


Next we make a setup.py file to nicely handle building this for us (of course we could run the gcc commands ourselves, but this way you don't need to remember all the flags):

from distutils.core import setup, Extension

module = Extension('hashtest', sources=['hash_test.c'], libraries=['rt'], extra_compile_args=['-O3'])
setup(name='Hash Test', version='1.0', ext_modules=[module])


And run python2.7 setup.py build, resulting in this output:

running build
running build_ext
building 'hashtest' extension
creating build
creating build/temp.linux-x86_64-2.7
x86_64-pc-linux-gnu-gcc -pthread -fPIC -I/usr/include/python2.7 -c hash_test.c -o build/temp.linux-x86_64-2.7/hash_test.o -O3
creating build/lib.linux-x86_64-2.7
x86_64-pc-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,--as-needed -L. build/temp.linux-x86_64-2.7/hash_test.o -L/usr/lib64 -lrt -lpython2.7 -o build/lib.linux-x86_64-2.7/hashtest.so


After typing cd build/lib.linux-x86_64-2.7/, there's a hashtest.so file in there. Making a simple Python file to test it now:

import hashtest
print hashtest.benchmark()


(See how natural that is?) And the results, drum roll.... secs: 20, hashes: 320458498, hashes/sec: 16022924.900000, Khashes/sec: 16022.924900

Dafuq, right? Still half as fast as it should be. Why?

I admit I have no idea. And this lack of understanding bugged me so much that this post has sat idle since June of 2012. Today I asked why on Stack Overflow, hopefully I'll get an answer and can update this post. It might just be something really stupid I'm doing.

Update: The answer has been revealed. By default gcc inlined the SuperFastHash function. But when creating a shared library with position independent code, gcc is unable to automatically inline the function for some reason. If you add the keyword inline to the SuperFastHash's function definition (providing you don't care about strict C89 conformance), that's enough to make gcc inline it in the shared object PIC case and the Python call will be just as fast as the regular executable. (There is a slight performance penalty to PIC code, but it's not enough to explain 50%!)

Update 2: I realized this about a day after making the last update but forgot to mention it here... adding the 'inline' keyword is really a bad thing to do, at least by itself. The right thing to do is mark the function 'static' (and optionally 'inline' after that). In other words, not inlining is what was supposed to happen because you built a shared object file and any functions not defined to be static are referable from the shared object. By marking the function as static you're alerting a user that they can't use that function from outside that file, and you're alerting the compiler that it doesn't need to reserve a space for it and can in fact inline it.

In my next post about this topic, I plan to (hopefully) show that algorithms can trump languages. I'll reiterate an earlier question: are you using the most efficient algorithm? If your algorithm is O(N^2), it's true that you can get a speedup by rewriting it in C, but if there exists an O(N) algorithm, chances are a straight Python implementation of that O(N) algorithm will beat the C version of the O(N^2) algorithm.

So I want to demonstrate the algorithm question by writing a "game" where a lot of boxes fly about the screen. If boxes collide with others or the screen edge, they reverse their direction. We'll consider two collision algorithms: a simple O(N^2) one that checks every box against every other box, written in C, and a faster algorithm that uses a sweep-and-prune algorithm written in Python. Will the algorithm difference matter for this particular application? We'll find out!

#### Posted on 2014-03-27 by Jach

Tags: c, programming, python, tips

LaTeX allowed in comments, use $\\...\\$\$ to wrap inline and $$...$$ to wrap blocks.