Make your numerical Python code fly at transonic speed!

Overview of the Python HPC landscape and zoom on Transonic 🚀

Python, only a great glue language?
Pierre Augier, Ashwin Vishnu
           
PySciDataGre (19 March 2019)

Python for High-Performance Computing?

  1. Fast prototyping (Numpy!)

  2. Popular:

    • Well-known

    • Several great libraries

  3. Share ideas between developers / scientists

    • Popularity counts

    • Readability counts

    • Expressivity counts

  4. Anyway, one needs a good and well-known scripting language so yes!

    (even considering Julia)

Where / when should we stop ?

Python & fast prototyping...

The software engineering method for scientists 👩‍🔬 👨‍🔬 and HPC

  1. Fast prototyping

  2. Solidify as needed

Again and again: (1, 2), (1, 2), ...

Python: a programming language, compromises ⚖️

Designed for fast prototyping & "glue" codes together

  • Generalist + easy to learn ⇒ huge and diverse community 👨🏿‍🎓🕵🏼 👩🏼‍🎓 👩🏽‍🏫👨🏽‍💻👩🏾‍🔬 🎅🏼 🌎 🌍 🌏

  • Expressivity and readability

  • Not oriented towards high performance

    (fast and easy dev, easy debug, correctness)

    • Highly dynamic 🐒 + introspection (inspect.stack())

    • Automatic memory management 💾

    • All objects encapsulated 🥥 (PyObject, C struct)

    • Objects accessible through "references" ➡️

    • Usually interpreted

Python interpreters

CPython

Interpreted (nearly) instruction per instruction, (nearly) no code optimization

The numerical stack (Numpy, Scipy, Scikits, ...) based on the CPython C API (CPython implementation details)!


PyPy

Optimized implementation with tracing Just-In-Time compilation

"Abstractions for free"

The CPython C API is an issue! PyPy can't accelerate Numpy code!


Micropython

For microcontrollers

Python & performance

References and PyObjects

In [2]:
mylist = [1, 3, 5]

list: array of references towards PyObjects

The C / Python border

In [3]:
arr = 2 * np.arange(10)
print(arr[2])
4

Python & performance

Python interpreters bad at crunching numbers

Pure Python terrible 🐢 (except with PyPy)...

In [4]:
from math import sqrt
my_const = 10.
result = [elem * sqrt(my_const * 2 * elem**2) for elem in range(1000)]

but even this is not very efficient (temporary objects)...

In [5]:
import numpy as np
a = np.arange(1000)
result = a * np.sqrt(my_const * 2 * a**2)

Even slightly worth with PyPy 🙁

Is Python efficient enough?

Python is known to be slow... But what does it mean?


Efficiency / inefficiency: depends on tasks ⏱


When is it inefficient? Especially for number crunching 🔢 ...


Can we write efficient scientific code in 🐍 ?

Book

Performance (generalities)

Measure ⏱, don't guess! Profile to find the bottlenecks.

Cprofile (pstats, SnakeViz), line-profiler, perf, perf_events


Do not optimize everything!

  • "Premature optimization is the root of all evil" (Donald Knuth)

  • 80 / 20 rule, efficiency important for expensive things and NOT for small things


CPU or IO bounded problems


Use the right algorithms and the right data structures!

For example, using Numpy arrays instead of Python lists...


Unittests before optimizing to maintain correctness!

unittest, pytest

"Crunching numbers" and computers architectures

 

CPU optimizations

  • pipelining, hyper-threading, vectorization, advanced instructions (simd), ...

  • important to get data aligned in memory (arrays)

Proper compilation needed for high efficiency !


Compilation to virtual machine instructions

What does CPython (compile, "byte code", nearly no optimization, see dis module)


Compilation to machine instructions

  • Just-in-time

    Has to be fast (warm up), can be hardware specific

  • Ahead-of-time

    Can be slow, hardware specific or more general to distribute binaries

Compilers are usually good for optimizations! Better than most humans...


Transpilation

From one language to another language (for example Python to C++)

Parallelism

Hardware:

  • Multicore CPU

  • Multi nodes super computers (MPI)

  • GPU (Nvidia: Cuda, Cupy) / Intel Xeon Phi


Different problems

  • CPU bounded (need to use cores at the same time)

  • IO bounded (wait for IO)

Different parallel strategies

IO bounded: one process + async/await


Cooperative concurrency


Functions able to pause


asyncio, trio

Different parallel strategies

One process split in light subprocesses called threads 👩🏼‍🔧 👨🏼‍🔧👩🏼‍🔧 👨🏼‍🔧

  • handled by the OS

  • share memory and can use at the same time different CPU cores

How?

  • OpenMP (C / C++ / Fortran)

  • In Python: threading and concurrent

⚠️ in Python, one interpreter per process (~) and the Global Interpreter Lock (GIL)...

  • In a Python program, different threads can run at the same time (and take advantage of multicore)

  • But... the Python interpreter runs the Python bytecodes sequentially !

    • Terrible 🐌 for CPU bounded if the Python interpreter is used a lot !

    • No problem for IO bounded !

Different parallel strategies

One program, $n$ processes 👩🏼‍🔧 👨🏼‍🔧👩🏼‍🔧 👨🏼‍🔧

Exchange data (for example with MPI):

Very efficient and no problem with Python!

  • mpi4py
  • h5py parallel

2 other packages for parallel computing with Python

  • dask
  • joblib

Python for HPC: first a glue language

Many tools to interact with static languages:

     ctypes, cffi, cython, cppyy, pybind11, f2py, pyo3, ...

Glue together pieces of native code (C, Fortran, C++, Rust, ...) with a nice syntax

     ⇒ Numpy, Scipy, ...

Remarks:

  • Numpy: great syntax for expressing algorithms, (nearly) as much information as in Fortran

  • Performance of a @ b (Numpy) versus a * b (Julia)?

         Same! The same library is called! (often OpenBlas or MKL)

General principle for perf with Python (not fully valid for PyPy):

Don't use too often the Python interpreter (and small Python objects) for computationally demanding tasks.

Pure Python

     → Numpy

         → Numpy without too many loops (vectorized)

            → C extensions

But ⚠️ ⚠️ ⚠️ writting a C extension by hand is not a good idea ! ⚠️ ⚠️ ⚠️

No need to quit the Python language to avoid using too much the Python interpreter !

Tools to compile Python / write C extensions

  • Langage: superset of Python

  • A great mix of Python / C / CPython C API!

    Very powerfull but a tool for experts!

  • Easy to study where the interpreter is used (cython --annotate).

  • Very mature

  • Now able to use Pythran internally...

My experience: large Cython extensions difficult to maintain

Numba: (per-method) JIT for Python-Numpy code

  • Very simple to use (just add few decorators) 🙂
In [6]:
from numba import jit

@jit
def myfunc(x):
    return x**2
  • "nopython" mode (fast and no GIL) 🙂

  • Also a "python" mode 🙂

  • GPU and Cupy 😀

  • Methods (of classes) 🙂

Python decorators

In [7]:
def mydecorator(func):
    # do something with the function
    print(func)
    # return a(nother) function
    return func
In [8]:
@mydecorator
def myfunc(x):
    return x**2
<function myfunc at 0x7fc5bd76f378>

This mysterious syntax with @ is just syntaxic sugar for:

In [9]:
def myfunc(x):
    return x**2

myfunc = mydecorator(myfunc)
<function myfunc at 0x7fc5bd76f598>

Numba: (per-method) JIT for Python-Numpy code

  • Sometimes not as much efficient as it could be 🙁

    (usually slower than Pythran / Julia / C++)


  • Only JIT 🙁


  • Not good to optimize high-level NumPy code 🙁

Pythran: AOT compiler for module using Python-Numpy

Transpiles Python to efficient C++

  • Good to optimize high-level NumPy code 😎

  • Extensions never use the Python interpreter (pure C++ ⇒ no GIL) 🙂

  • Can produce C++ that can be used without Python

  • Usually very efficient (sometimes faster than Julia)

    • High and low level optimizations

      (Python optimizations and C++ compilation)

    • SIMD 🤩 (with xsimd)

    • Understand OpenMP instructions 🤗 !

  • Can use and make PyCapsules (functions operating in the native word) 🙂

High level transformations

In [11]:
# calcul of range
print_optimized("""
def f(x):
    y = 1 if x else 2
    return y == 3
""")
def f(x):
    return 0

In [12]:
# inlining
print_optimized("""
def foo(a):
    return  a + 1
def bar(b, c):
    return foo(b), foo(2 * c)
""")
def foo(a):
    return a + 1


def bar(b, c):
    return ((b + 1), ((2 * c) + 1))

In [13]:
# unroll loops
print_optimized("""
def foo():
    ret = 0
    for i in range(1, 3):
        for j in range(1, 4):
            ret += i * j
    return ret
""")
def foo():
    ret = 0
    ret += 1
    ret += 2
    ret += 3
    ret += 2
    ret += 4
    ret += 6
    return ret

In [14]:
# constant propagation
print_optimized("""
def fib(n):
    return n if n< 2 else fib(n-1) + fib(n-2)
    
def bar(): 
    return [fib(i) for i in [1, 2, 8, 20]]
""")
import functools as __pythran_import_functools


def fib(n):
    return n if (n < 2) else (fib((n - 1)) + fib((n - 2)))


def bar():
    return [1, 1, 21, 6765]


def bar_lambda0(i):
    return fib(i)

In [15]:
# advanced transformations
print_optimized("""
import numpy as np
def wsum(v, w, x, y, z):
    return sum(np.array([v, w, x, y, z]) * (.1, .2, .3, .2, .1))
""")
import numpy as __pythran_import_numpy


def wsum(v, w, x, y, z):
    return __builtin__.sum(
        ((v * 0.1), (w * 0.2), (x * 0.3), (y * 0.2), (z * 0.1))
    )

Pythran: AOT compiler for module using Python-Numpy

  • Compile only full modules (⇒ refactoring needed 🙁)
  • Only "nopython" mode

    • limited to a subset of Python

      • only homogeneous list / dict 🤷‍♀️
      • no methods (of classes) 😢 and user-defined class
    • limited to few extension packages (Numpy + bits of Scipy)

    • pythranized functions can't call Python functions

  • No JIT: need types (written manually in comments)

  • Lengthy ⌛️ and memory intensive compilations

  • Debugging 🐜 Pythran requires C++ skills!

  • No GPU (maybe with OpenMP 4?)

  • Intel compilers unable to compile Pythran C++11 👎

First conclusions

  • Python great language & ecosystem for sciences & data
  • Performance issues, especially for crunching numbers 🔢

    ⇒ need to accelerate the "numerical kernels"

  • Many good accelerators and compilers for Python-Numpy code

    • All have pros and cons!

    ⇒ We shouldn't have to write specialized code for one accelerator!

  • Other languages don't replace Python for sciences

    • Modern C++ is great and very complementary 💑 with Python

    • Julia is interesting but not the heaven on earth

Make your numerical Python code fly at transonic speed 🚀 !

Transonic is landing 🛬 !

Pure Python package (>= 3.6) to easily accelerate modern Python-Numpy code with different accelerators

Work in progress! Current state: one backend based on Pythran!

  • Keep your Python-Numpy code clean and "natural" 🧘

  • Clean type annotations (🐍 3)

  • Easily mix Python code and compiled functions

  • JIT based on AOT compilers

  • Methods (of classes) and blocks of code

Transonic: examples from real-life packages

Works also well in simple scripts and IPython / Jupyter.

Transonic: how does it work?

  • AST analyses (using Beniget, no import at compilation time)
In [24]:
# abstract syntax tree
import ast
tree = ast.parse("great_tool = 'Beniget'")
assign = tree.body[0]
print(f"{assign.value.s} is a {assign.targets[0].id}")
Beniget is a great_tool
  • Write the (Pythran) files when needed

  • Compile the (Pythran) files when needed

  • Use the fast solutions when available

Transonic: Perspectives

  • Alternative syntax for blocks of code (with block():)

  • PyCapsules

  • Backends using Cython, Numba, Cupy

Need funding 💰 !

Pythran and Transonic are cool projects. But no 💰! A difference compared to Numba.

Conclusions

  • Very nice and efficient scientific software can be easily built with modern Python

My personal choice / hope for HPC for humans and 🐍

  • PyPy (Python abstraction for free) + Numpy accelerators used through Transonic

  • Modern C++ for more fundamental tools (with multi-language API)