Gartner® Hype Cycle™ for Data Management 2024
Read The ReportGartner® Data Management 2024
Read The ReportDiscover six fundamental data structures and their implementation in Python.
Pythonistas are well versed in the intricacies of Python data types, which come equipped with out-of-the-box built-in methods to store, manipulate, and reshape data at the rhythm of keyboard strokes.
But not many data scientists and data engineers fully unlock the potential of data structures in Python - the potential to write code that is easier to maintain, more efficient by orders of magnitude, and solves problems more elegantly.
In this article, we take an in-depth look at data structures in Python through multiple topics.
Data structures sit at the intersection between computer science and data science. They occupy the in-between world of highly performant machine code on one hand, and easy-to-comprehend abstracted functions on the other.
Data structures describe how data should be organized so that it solves a computational problem while making its storage (space) complexity and time complexity as optimized as possible.
We have touched on some data structures before - you may remember the dictionary in Python, which is highly performant (O(1) time complexity at retrieving values stored in a dictionary when given a key).
For now, though, we will dive deeper into the classic and fundamental data structures:
For each data type, we will do three things:
Before we do that, we first need to clarify a murky boundary. How, if at all, do data structures differ from data types in Python?
In traditionally favored programming languages like Java and C++, the languages draw a strong distinction between Abstract Data Types (ADT) and Data Structures.
In those strongly-typed languages, ADTs are an abstraction that saves us from writing computer code ourselves. ADTs offer groups of operations that you can perform over a data type. The Python equivalent would be the class methods over data types, such as round(float) or str.upper() or dict.keys().
On the other hand, data structures are the implementation of those ADTs. To keep things performant, strongly-typed languages build upon tried-and-tested data structures that offer the best possible methods for solving computational problems. An example of that would be the class dict() in Python, which is extremely efficient at reading operations for named key-value pairs.
But Python differs from those languages because it does not strictly distinguish between ADTs and data structures. In fact, everything is an object in Python and therefore inherits from the class that it descends from. This is why the number 1 is treated as True in Boolean logic - Boolean values are class descendants of the integer class.
So, why distinguish between data types and data structures in Python at all?
The main reason is performance - understanding how to implement a data structure can improve performance significantly. But there are other benefits to solving problems with data structures in Python...
There are multiple advantages to using data structures:
Let’s check how these benefits stem from actual data structure implementations.
An array is a compact data structure that sequentially stores basic data types (integers, floats, etc.) of information by allocating sequential memory slots. It is the simplest way to store information on a computer.
Technically, an array specifies that all array elements must be of the same data type.
Items in an array can be accessed instantly (O(1) time).
Arrays are often used in numerical computation tasks or as a supporting data structure to implement other data structures (e.g. stacks, lists, and queues). When talking about numerical operations, arrays are the linear algebraic implementation that allows for vector and matrix operations or the building blocks of many machine-learning algorithms.
Python has a numeric implementation of arrays within the array module. It allows users to specify the Python (or C) data type that the array holds at initiation. It can hold elements of floats, integers, or Unicode characters (16- or 32-bits, depending on the platform).
The Python array also comes with many in-built methods. These include append(item) for adding items to the end of the array, count(item) for counting the number of occurrences of an item within the array, insert(index, item) to insert an item at position index, remove(item) for item deletion, and more.
There are other more specialized implementations of arrays in Python, such as bytes objects (immutable sequence of single bytes), bytearray objects (mutable arrays of single bytes), str objects (immutable arrays of Unicode characters declared as single quotes, double quotes, or triple quotes), and tuple objects (check out our article on this immutable sequence).
When it comes to arrays, Python does implement this data structure as part of its Standard Library, but you are better off using the NumPy library instead. It is optimized for scientific and vectorized computing.
A stack is a container of objects where item insertion and item deletion are carried out according to the LIFO principle. LIFO stands for ‘Last In First Out’ and it signifies the priority of items at insertion and removal.
You can think of a stack as a pile of plates. It would be impractical to pick up the bottom plate, so usually, we’d go for the one on top (the last one added to the stack).
The stack needs to implement at least two operations:
Of course, the stack data structure can include other elements, such as a function that returns the number of elements in a stack, or one that checks whether a stack is empty.
The stack has many uses in the world of computers:
The easiest way to implement a stack in Python is to use lists. Lists allow you to add new items to the existing stack with the append() method, and remove the last element added with the pop() method.
If you need a refresher on how to initiate a list (spoiler alert: square brackets suffice) and other methods of working with lists, have a read of our article on built-in data types.
Queues are also collections of items that hold elements, but you could say that they are the opposite of stacks. Queues collect elements according to the FIFO concept - ‘First in First Out’ - so the first element added to the queue is the first one to be deleted.
The queue needs to implement at least two operations:
You can think of a queue as a line of people waiting in front of a food truck. The first person in the queue is the first person to be served.
The queue has many pragmatic applications:
Many programmers use lists to implement queues, since you can implement the FIFO principle with slicing. However, as is noted in the Python documentation:
“[L]ists are not efficient for this purpose. While appends and pops from the end of list are fast, doing inserts or pops from the beginning of a list is slow (because all of the other elements have to be shifted by one).“
Instead, use collections.deque, which was designed to have fast appends and pops from both ends. The module collections.deque could also be used for stacks but in this case, we can focus on deque methods append() and popleft(). The former adds a new item to the right (end) of the queue, while the latter returns and removes the item to the left (beginning) of the queue.
The map data structure (also referred to as hash tables, lookup tables, hashmaps, or associative arrays) is a collection of named items. It is extremely efficient at item insertion, lookup, and item deletion.
However, this efficiency does come at a cost - it requires a lot more space than other data structures.
This is because the map is implemented as a key-value table, where each value has an associated unique key used for fast item retrieval.
Maps implement the following operations:
Maps are extremely beneficial for any applications that rely heavily on fast data read-time and are not too affected by space constraints. This involves the majority of modern web apps, which return information efficiently. The map is also very popular among data scientists during data cleaning, where checking values against a key-value baseline is used to determine if data has been corrupted in transit or during collection and transformation.
Maps are implemented as the built-in data type dict! The Python implementation requires you to pick an immutable object for a key (mutable objects such as lists cannot be transformed into reliable hashes since they could change with the hash) and it then constructs a hash table in the background.
If you need a refresher on how to initiate the dictionary in Python (spoiler alert: put the key-value pairs between curly braces) and other methods of working with dictionaries, check out our article on built-in data types.
Graphs arrange data as multiple nodes (or vertices in mathematics) connected with relationships (or edges in mathematics).
Graphs are structures used to model pairwise relations between objects. As an example, the node “Customer John Smith” could have a relation of “bought” with the node “Shampoo ‘Amazing Hair’”.
Graphs are extensively used to model other networks that naturally occur in real life, such as social media connections, semantic linguistic analysis, website traffic and interlinkedness, the topology of atoms, etc. (and the list really goes on).
This is the most flexible data structure of all because one could construct every other data structure using graphs.
A simple way to implement a graph is with a Python dictionary, whereby the direction of a relationship goes from the key (source node) to the value (target node). This might be cumbersome, but it is rather efficient.
A more intuitive approach (and one widely used by data scientists and network engineers) is to use the NetworkX library. This library allows you to easily add and remove both nodes and edges and comes with a lot of out-of-the-box methods, such as ones for finding the shortest path between nodes, or measures of associativity and centrality within the graph.
A set is an unordered collection of unique (non-duplicated) items. It is a data structure that mimics mathematical sets, so it also comes with mathematical operations, such as membership checks (is an item part of a set? A subset of another set? And so on...), and between-set operations (unions, intersections, etc.).
Sets are used for extremely efficient set operations. They are often used for validating data (e.g. as part of data cleaning or even client-facing applications, such as checking if the client credentials are in the database), finding differences between two datasets, joining unique values together (e.g. when constructing new datasets), and more.
Sets have a native implementation in Python with the set object. Using the set object’s methods enables efficient and optimized out-of-the-box set operations with method calls.
In this tutorial, you have learned:
What’s next?
Now it’s time to put your new-found knowledge to the test. Use the data structures to improve existing code or write new high-performance scripts.