How to Implement Your Own Data Structure in Python
Python provides full-fledged support for implementing your own data structure using classes and custom operators. In this tutorial you will implement a custom pipeline data structure that can perform arbitrary operations on its data. We will use Python 3.
The Pipeline Data Structure
The pipeline data structure is interesting because it is very flexible. It consists of a list of arbitrary functions that can be applied to a collection of objects and produce a list of results. I will take advantage of Python's extensibility and use the pipe character ("|") to construct the pipeline.
Before diving into all the details, let's see a very simple pipeline in action:
x = range(5) | Pipeline() | double | Ω print(x) [0, 2, 4, 6, 8]
What's going on here? Let's break it down step by step. The first element
range(5) creates a list of integers [0, 1, 2, 3, 4]. The integers are fed into an empty pipeline designated by
Pipeline(). Then a "double" function is added to the pipeline, and finally the cool
Ω function terminates the pipeline and causes it to evaluate itself.
The evaluation consists of taking the input and applying all the functions in the pipeline (in this case just the double function). Finally, we store the result in a variable called x and print it.
Python supports classes and has a very sophisticated object-oriented model including multiple inheritance, mixins, and dynamic overloading. An
__init__() function serves as a constructor that creates new instances. Python also supports an advanced meta-programming model, which we will not get into in this article.
Here is a simple class that has an
__init__() constructor that takes an optional argument
x (defaults to 5) and stores it in a
self.x attribute. It also has a
foo() method that returns the
self.x attribute multiplied by 3:
class A: def __init__(self, x=5): self.x = x def foo(self): return self.x * 3
Here is how to instantiate it with and without an explicit x argument:
>>> a = A(2) >>> print(a.foo()) 6 a = A() print(a.foo()) 15
With Python, you can use custom operators for your classes for nicer syntax. There are special methods known as "dunder" methods. The "dunder" means "double underscore". These methods like "__eq__", "__gt__" and "__or__" allow you to use operators like "==", ">" and "|" with your class instances (objects). Let's see how they work with the A class.
If you try to compare two different instances of A to each other, the result will always be False regardless of the value of x:
>>> print(A() == A()) False
This is because Python compares the memory addresses of objects by default. Let's say we want to compare the value of x. We can add a special "__eq__" operator that takes two arguments, "self" and "other", and compares their x attribute:
def __eq__(self, other): return self.x == other.x
>>> print(A() == A()) True >>> print(A(4) == A(6)) False
Implementing the Pipeline as a Python Class
Now that we've covered the basics of classes and custom operators in Python, let's use it to implement our pipeline. The
__init__() constructor takes three arguments: functions, input, and terminals. The "functions" argument is one or more functions. These functions are the stages in the pipeline that operate on the input data.
The "input" argument is the list of objects that the pipeline will operate on. Each item of the input will be processed by all the pipeline functions. The "terminals" argument is a list of functions, and when one of them is encountered the pipeline evaluates itself and returns the result. The terminals are by default just the print function (in Python 3, "print" is a function).
Note that inside the constructor, a mysterious "Ω" is added to the terminals. I'll explain that next.
The Pipeline Constructor
Here is the class definition and the
class Pipeline: def __init__(self, functions=(), input=(), terminals=(print,)): if hasattr(functions, '__call__'): self.functions = [functions] else: self.functions = list(functions) self.input = input self.terminals = [Ω] + list(terminals)
Python 3 fully supports Unicode in identifier names. This means we can use cool symbols like "Ω" for variable and function names. Here, I declared an identity function called "Ω", which serves as a terminal function:
Ω = lambda x: x
I could have used the traditional syntax too:
def Ω(x): return x
The "__or__" and "__ror__" Operators
Here comes the core of the Pipeline class. In order to use the "|" (pipe symbol), we need to override a couple of operators. The "|" symbol is used by Python for bitwise or of integers. In our case, we want to override it to implement chaining of functions as well as feeding the input at the beginning of the pipeline. Those are two separate operations.
The "__ror__" operator is invoked when the second operand is a Pipeline instance as long as the first operand is not. It considers the first operand as the input and stores it in the
self.input attribute, and returns the Pipeline instance back (the self). This allows the chaining of more functions later.
def __ror__(self, input): self.input = input return self
Here is an example where the
__ror__() operator would be invoked:
'hello there' | Pipeline()
The "__or__" operator is invoked when the first operand is a Pipeline (even if the second operand is also a Pipeline). It accepts the operand to be a callable function and it asserts that the "func" operand is indeed callable.
Then, it appends the function to the
self.functions attribute and checks if the function is one of the terminal functions. If it is a terminal then the whole pipeline is evaluated and the result is returned. If it's not a terminal, the pipeline itself is returned.
def __or__(self, func): assert(hasattr(func, '__call__')) self.functions.append(func) if func in self.terminals: return self.eval() return self
Evaluating the Pipeline
As you add more and more non-terminal functions to the pipeline, nothing happens. The actual evaluation is deferred until the
eval() method is called. This can happen either by adding a terminal function to the pipeline or by calling
The evaluation consists of iterating over all the functions in the pipeline (including the terminal function if there is one) and running them in order on the output of the previous function. The first function in the pipeline receives an input element.
def eval(self): result =  for x in self.input: for f in self.functions: x = f(x) result.append(x) return result
Using Pipeline Effectively
One of the best ways to use a pipeline is to apply it to multiple sets of input. In the following example, a pipeline with no inputs and no terminal functions is defined. It has two functions: the infamous
double function we defined earlier and the standard
Then, we provide it three different inputs. In the inner loop, we add the
Ω terminal function when we invoke it to collect the results before printing them:
p = Pipeline() | double | math.floor for input in ((0.5, 1.2, 3.1), (11.5, 21.2, -6.7, 34.7), (5, 8, 10.9)): result = input | p | Ω print(result) [1, 2, 6] [23, 42, -14, 69] [10, 16, 21]
You could use the
keep_palindromes = lambda x: (p for p in x if p[::-1] == p) keep_longer_than_3 = lambda x: (p for p in x if len(p) > 3) p = Pipeline() | keep_palindromes | keep_longer_than_3 | list (('aba', 'abba', 'abcdef'),) | p | print ['abba']
There are a few improvements that can make the pipeline more useful:
- Add streaming so it can work on infinite streams of objects (e.g. reading from files or network events).
- Provide an evaluation mode where the entire input is provided as a single object to avoid the cumbersome workaround of providing a collection of one item.
- Add various useful pipeline functions.
Python is a very expressive language and is well equipped for designing your own data structure and custom types. The ability to override standard operators is very powerful when the semantics lend themselves to such notation. For example, the pipe symbol ("|") is very natural for a pipeline.
A lot of Python developers enjoy Python's built-in data structures like tuples, lists, and dictionaries. However, designing and implementing your own data structure can make your system simpler and easier to work with by elevating the level of abstraction and hiding internal details from users. Give it a try.
Source: Tuts Plus