High performance Python techniques and paradigms

Python is a language that I find is easy to learn, but difficult to master. These notes are part of my ongoing task of becoming as proficient at this language as I can.

Advanced collections

Beyond lists, tuples and dicts, python also has several other built-in collections with their own uses and applications.

collections.ChainMap

Useful for combining multiple dictionary mappings. When searching for a variable in the current scope, python first searches locals(), then globals(), and finally builtins. Commonly, the search pattern could be implemented

import builtins

mappings = globals(), locals(), vars(builtins)
for mapping in mappings:
	for key in mapping:
		if key in mapping:
			value = mapping[key]
			break
else:	# only triggers if for loop not exited through break
	raise NameError(f'name {key} is not defined')

However, since Python3.3, a more elegant solution is to use collections.ChainMap

import builtins
import collections

mappings = collections.ChainMap(globals(), locals(), vars(builtins))
value = mappings[key]

collections.Counter

The Counter is an excellent way of keeping track of the occurrences of an element

import collections

counter = collections.Counter('abbcd')
for k in 'abbcd':
	print(f'Count for {k}: {counter[k]}')

It uses heapq implementations for obtaining the most common elements, and as such scales beautifully even with large number of elements

import math
import collections

counter = collections.Counter()
for i in range(100000):
	counter[math.sqrt(i) // 25] += 1

for key, count in counter.most_common(5):
	print(f'{key} : {count}')

It also is able to use more complex operations, such as addition, subtraction, intersection, and unions, similar to set

import collections

def print_counter(expression, counter):
	sorted_characters = sorted(counter.elements())
	print(expression, ''.join(sorted_characters))

eggs = collections.Counter('eggs')
spam = collections.Counter('spam')

print_counter('eggs:', eggs)	
# eggs: eggs
print_counter('eggs & spam:', eggs & spam)
# eggs & spam: s
print_counter('eggs - smap:', eggs - spam)
# eggs - smap: egg
print_counter('eggs + spam:', eggs + spam)
# eggs + spam: aeggmpss
print_counter('eggs | spam:', eggs | spam)
# eggs | spam: aeggmps

collections.deque

Double Ended Queue, or deque, is a very low-level collection, equivalent to a double linked list; every item in the list points to the next and the previous, with the list reference pointing to both the first and last elements. Thus, adding and subtracting elements from either side of the deque is an O(1) operation

import collections

queue = collections.deque()
queue.append(1)
queue.append(2)

print(queue)		# deque([1, 2])
queue.pop()			# 2
queue.popleft()		# 1
queue.popleft()		# IndexError: pop from an empty deque

The deque also implements maxlen as a kwarg. Consider the example

import collections

looper = collections.deque(maxlen=2)
for i in range(5):
	looper.append(i)
	print(looper)

# deque([0], maxlen=2)
# deque([0, 1], maxlen=2)
# deque([1, 2], maxlen=2)
# deque([2, 3], maxlen=2)
# deque([3, 4], maxlen=2)

The threading queue.Queue wraps a deque but implements locks and checks to make it threadsafe. Similar, asyncio.Queue is specialized for asynchronous use, and multiprocessing.Queue for multiprocess operations (which I will write notes for later).

collections.defaultdict

To put it simply, a defaultdict is a dictionary with a default value. The benefit of such an item prevents having to check if a key exists before accessing it, and has great application in implementing in-memory databases. As an illustration, consider a graph structure

nodes = [
	('a', 'b'),
	('a', 'c'),
	('b', 'a'),
	('b', 'd'),
	('c', 'a'),
	('d', 'a'),
	('d', 'b'),
	('d', 'c')	
]

graph = dict()
for from_, to in nodes:
	if from_ not in graph:
		graph[from_] = []
	graph[from_].append(to)

import pprint
pprint.pprint(graph)

# {'a': ['b', 'c'], 'b': ['a', 'd'], 'c': ['a'], 'd': ['a', 'b', 'c']}

This could be implemented more elegantly

import collections

graph = collections.defaultdict(list)
for from_, to in nodes:
	graph[from_].append(to)

pprint.pprint(graph)

# defaultdict(<class 'list'>,
#             {'a': ['b', 'c'],
#              'b': ['a', 'd'],
#              'c': ['a'],
#              'd': ['a', 'b', 'c']})

It’s trivial to see how a Counter could be implemented with defaultdict, but the version included in collections is far more extensive. Note that the default value must be a callable object (e.g. int). This means trees could be very easily defined as

import collections

def tree():
	return collections.defaultdict(tree)

and used

import json

colours = tree()
colours['other']['black'] = 0x000000
colours['other']['white'] = 0xFFFFFF
colours['primary']['red'] = 0xFF0000
colours['primary']['red'] = 0x00FF00
colours['primary']['blue'] = 0x0000FF

print(json.dumps(colours, sort_keys=True, indent=4))

# {
#     "other": {
#         "black": 0,
#         "white": 16777215
#     },
#     "primary": {
#         "blue": 255,
#         "red": 65280
#     }
# }

Such a data structure is able to generate itself recursively.

collections.namedtuple

This collection is essentially a tuple with field names; it lends itself to describing coordinates well

import collections

Point = collections.namedtuple('Point', ['x', 'y', 'z'])
point_a = Point(1, 2, 3)
print(point_a)
# Point(x=1, y=2, z=3)

point_b = Point(x=4, z=5, y=6)
print(point_b)
# Point(x=4, y=6, z=5)

Properties can be accessed by index or by name

print(point_a.x == point_a[0])
# True

collections.enum

Essentially a Python implementation of Clang familiar enum, allowing constants without magic numbers. For example

import enum

class Colour(enum.Enum):
	red = 1
	green = 3
	blue = 2

print(Colour.red == Colour['red'])	# True
print(Colour.red == Colour(1))		# True

print(Colour.red == 1)				# False
print(Colour.red.value == 1)		# True

As you might expect, enums are iterable

for colour in Colour:
	print(colour)

# Colour.red
# Colour.green
# Colour.blue

and have value as well as text representation

cols = dict()
cols[Colour.green] = 0x00FF00

print(cols)

Derived classes also support type inheritance, such as

import enum

class Simple(enum.Enum):
	PROP = 'prop'

print(Simple.PROP == 'prop')	# False

class Derived(str, enum.Enum):
	PROP = 'prop'

print(Derived.PROP == 'prop')	# True

collections.OrderedDict

To put simply, a dictionary where the insertion order is important. By contrast, dict will return keys in the order of hash, whereas OrderedDict returns the keys in order of insertion

import collections
spam = collections.OrderedDict()
spam['a'] = 1
spam['c'] = 3
spam['b'] = 2
print(spam)
# OrderedDict([('a', 1), ('c', 3), ('b', 2)])

print(collections.OrderedDict(sorted(spam.items())))
# OrderedDict([('a', 1), ('b', 2), ('c', 3)])

It is implemented by a dict in combination with a doubly linked list for keeping track of the order. Another dict is used to keep track of reverse relation. Within this, set and get are O(1) but the object requires more memory than conventional dictionaries, and as such don’t scale very efficiently.

heapq

Essentially an ordered list, and very useful for creating priority queues, able to make the smallest (or largest) item in the list easily obtainable

import heapq

heap = [1, 3, 4, 7, 2, 4, 3]
heapq.heapify(heap)
print(heap)
# [1, 2, 3, 7, 3, 4, 4]

A heap is a binary tree for which the parent nodes has a value less than or equal to any of its children. We can visualize it by iterating through our object

while heap:
	heapq.heappop(heap), heap

# 1 [2, 3, 3, 7, 4, 4]
# 2 [3, 3, 4, 7, 4]
# 3 [3, 4, 4, 7]
# 3 [4, 4, 7]
# 4 [4, 7]
# 4 [7]
# 7 []

bisect

A sorted list; where heapq makes obtaining the smallest (or largest) item easy, bisect inserts items so that the overall list stays sorted. As such, finding in bisect is fast, whereas adding and removing in heapq is fast. Adding new elements to a bisect is an O(n) operation, and creating a sorted list using bisect takes O(n^2), compared to e.g. heapq with O(n log(n)). The primary use of bisect is then to extend an already ordered structure, as opposed to creating a new one

import bisect

# regular sort
sorted_list = []
sorted_list.append(5)
sorted_list.append(3)
sorted_list.append(1)
sorted_list.append(2)	# each operation so far is O(1)
sorted_list.sort()		# O(n * log(n)) = O(8)

print(sorted_list)
# [1, 2, 3, 5]

# bisect
sorted_list = []
bisect.insort(sorted_list, 5)	# O(n) = O(1)
bisect.insort(sorted_list, 3)	# O(2)
bisect.insort(sorted_list, 1)	# ...
bisect.insort(sorted_list, 2)	# O(4)

print(sorted_list)
# [1, 2, 3, 5]

Searching in such a data structure is likewise very fast; e.g. bisect_left finds the position at which a number is supposed to be

import bisect

sorted_list = [1, 2, 3, 5]

def contains(sorted_list, value):
	i = bisect.bisect_left(sorted_list, value)
	return i < len(sorted_list) and sorted_list[i] == value

contains(sorted_list, 2)	# True
contains(sorted_list, 4)	# False

The implementation details for such a search is a binary search algorithm.

Some notes on comprehensions

Pythonic list and dictionary comprehension are powerful ways of generating complex data structures and algorithms, but at the cost of readability. They are analogous to simple for loops

some_list = []
for i in range(10):
	some_list.append(i)

comp_list = [i for i in range(10)]

print(some_list == comp_list)		# True

They can however very quickly become counter intuitive. For example, filtering can be applied

[i for i in range(100) if i % 2 == 1]	# odd numbers

functions may be executed within them, although complications arise if used with PRNGs

import random
[random.random() for _ in range(10) if random.random() >= 0.5]		# may yield values less than 0.5
[x for x in [random.random() for _ in range(10)] if x >= 0.5]		# may not produce 10 items
[x for _ in range(10) for x in [random.random()] if x >= 0.5]		# obtains desired result

The last line in the above extract is analogous to nested for loops. For example

items = []
for i in range(5):
	for j in range(3, 5):
		results.append((i, j))

may be written with list comprehension as

[(i, j) for i in range(5) for j in range(3, 5)]

The same syntax exists for dictionary comprehension too

{i : i**2 for i in range(10) if i % 2 == 1}		# odd squares indexed by their root

The two may be used in one another also

import random
[{i : x for x in range(i)} for i in range(10)]
# [{}, {1: 0}, {2: 1}, {3: 2}, {4: 3}, {5: 4}, {6: 5}, {7: 6}, {8: 7}, {9: 8}]

God knows why you would want that, but you have it.

Finally, you also have set comprehension

{i * j for i in range(5) for j in range(6)}
# {0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 20}

Some peculiarities of lambda functions

Lambda functions can be nested, such as to provide quick methods for providing callback functions or even decorators. For example

square_result = lambda func: lambda val: func(val)**2

def func(x): return 4 * x
func(3)			# 12

@square_result
def func(x): return 4 * x
func(3)			# 144

Lambda functions can also be called recursively

factorial = lambda x: 1 if x == 1 else x * factorial(x-1)
factorial(4)	# 24

Finally, if you wanted to be a real snide prick, you can implement quicksort in a few lines using the famous Y from lambda-calculus

Y = lambda f: lambda *args: f(Y(f))(*args)

quicksort = Y(lambda f: lambda x: (
		f([i for i in x if i < x[0]]) +
		[y for y in x if x[0] == y] + 
		f([i for i in x if i > x[0]])
	) if x else []
)

quicksort([1, 7, 5, 6, 2, 3, 9, 7, 8, 2, 4, 5, 2, 1, 6, 2])
# [1, 1, 2, 2, 2, 2, 3, 4, 5, 5, 6, 6, 7, 7, 8, 9]

Using itertools