--> Nikola Dakić - Blog About Software Engineering, DevOps and AI – Iterators and Generators

Iterators and Generators

Posted on Wed 25 October 2023 in SWE

Hey there, Pythonistas! 🐍

In the world of Python programming, understanding how to manage and work with data efficiently is a crucial skill. These concepts, although initially intimidating for some, are actually quite simple to understand and use. In this post, we'll go over the basics of iterators and generators, and how to use them in your code.

What are iterators?

An iterator, in its essence, is any Python object that can be iterated (looped) over. Most built-in containers in Python like list, tuple, and string are iterables, but they are not iterators. In order for an object to be an iterator, it must implement the __iter__() and __next__() methods. The __iter__() method returns the iterator object itself, and is used in the iteration process. The __next__() method returns the next item in the sequence, and is used in the iteration process as well.

Here's how an iterator works in simple terms:

  1. It fetches items one by one.
  2. It maintains the state of iteration, meaning you can pause and resume iteration.
  3. It raises a StopIteration exception when there are no items left to be returned.

To create an iterator object from an iterable, you can use the iter() function and then use the next() function to retrieve the next item from the iterator.

my_list = [1, 2, 3, 4]
iter_obj = iter(my_list)
print(next(iter_obj))  # 1
print(next(iter_obj))  # 2
print(next(iter_obj))  # 3
print(next(iter_obj))  # 4
print(next(iter_obj))  # Raises StopIteration exception

What are generators?

Generators are a type of iterator, which produce a sequence of values on-the-fly. They are easy to implement and can be incredibly useful when working with large datasets or continuous data streams. Instead of defining all methods required by an iterator (__iter__ and __next__), a generator only needs a function with one or more yield statements. When the function is called, it returns an object but does not start execution. Once the function's __next__ method is called, it begins executing until it reaches a yield statement, then it returns the yielded value and pauses.

The primary advantages of using generators are:

  1. Simplified code: No need to implement an iterator's methods.
  2. Memory efficiency: Generators are lazy and produce values on-the-fly, not storing them in memory like lists.

Here's an example:

def simple_generator():
    yield 1
    yield 2
    yield 3

gen = simple_generator()
print(next(gen))  # 1
print(next(gen))  # 2
print(next(gen))  # 3
print(next(gen))  # Raises StopIteration exception

Generator Expressions

Generator expressions are a high-level, concise way to create generators.
They look similar to list comprehensions but use parentheses.
Example:

squares = (x*x for x in range(4))
print(next(squares))  # 0
print(next(squares))  # 1

Why Use Generators and Iterators?

  1. Memory Efficiency: Iterators and generators allow processing large datasets without loading everything into memory.
  2. On-the-fly computation: They compute values on the go, reducing initial wait time.
  3. Cleaner Code: Generators, especially, can make code more readable by abstracting away the iteration mechanics.

Real-World Use Case - Processing Large File

Here you can find a file (zipped) called organizations-2000000.csv that contains 2 million rows of data. We will use this file, but imagine that it's much larger, and we can't load it all into memory at once. Our goal is to filter out all organizations that have more than 9900 employees and save them to a new file. So let's see how we can do this using generators.

def read_large_csv_file(file_path):
    with open(file_path, 'r') as file:
        csvreader = csv.reader(file)
        next(csvreader)  # skip the header
        for line in csvreader:
            yield line

We defined a generator function called read_large_csv_file that takes a file path as an argument and yields each line of the file. Notice that we're using the yield keyword instead of return. This is what makes this function a generator. Now, let's use this function to filter out the organizations that have more than 9900 employees.

if __name__ == "__main__":
    organizations = read_large_csv_file(SOME_LARGE_FILE)  # generator object
    with open(OUTPUT_FILE, "wt", encoding="utf-8") as output:
        output.write(OUTPUT_HEADER)
        for organization in organizations:
            if int(organization[8]) >= 9900:  # do some filtering
                output.write(f"{','.join(organization)}\n")

We used the read_large_csv_file function to create a generator object called organizations. Then, we opened a new file called output.csv and wrote the header to it. Finally, we iterated over the organizations generator object and wrote the filtered organizations to the output file one by one. This way, we can process large files without loading them all into memory at once. Pretty cool, right? 😎 Try it out yourself!
Just remember that generators can only be iterated over once, so if you need to iterate over the same data multiple times, you'll need to create a new generator object each time.

Full code can be found here.

Conclusion

Generators and iterators are powerful tools in Python's arsenal, allowing for efficient and readable code when dealing with data streams or large datasets. Whether you're looking to process large files or work with continuous data streams, understanding these concepts will undoubtedly boost your Python prowess. Happy coding! 🚀🐍

If you would like to get more content like this directly to your inbox, consider subscribing to my newsletter.

Happy coding, and thanks for reading! 🚀🐍