Clean Code and The Art of Exception Handling

Exceptions are as old as programming itself. Back in the days when programming was done in hardware, or via low-level programming languages, exceptions were used to alter the flow of the program, and to avoid hardware failures. Today, Wikipedia defines exceptions as:

anomalous or exceptional conditions requiring special processing – often changing the normal flow of program execution…

And that handling them requires:

specialized programming language constructs or computer hardware mechanisms.

So, exceptions require special treatment, and an unhandled exception may cause unexpected behavior. The results are often spectacular. In 1996, the famous Ariane 5 rocket launch failure was attributed to an unhandled overflow exception. History’s Worst Software Bugs contains some other bugs that could be attributed to unhandled or miss-handled exceptions.

Over time, these errors, and countless others (that were, perhaps, not as dramatic, but still catastrophic for those involved) contributed to the impression that exceptions are bad.

The results of improperly handling exceptions have led us to believe that exceptions are always bad.

But exceptions are a fundamental element of modern programming; they exist to make our software better. Rather than fearing exceptions, we should embrace them and learn how to benefit from them. In this article, we will discuss how to manage exceptions elegantly, and use them to write clean code that is more maintainable.

Exception Handling: It’s a Good Thing

With the rise of object-oriented programming (OOP), exception support has become a crucial element of modern programming languages. A robust exception handling system is built into most languages, nowadays. For example, Ruby provides for the following typical pattern:

begin
do_something_that_might_not_work!
rescue SpecificError => e
do_some_specific_error_clean_up
retry if some_condition_met?
ensure
this_will_always_be_executed
end

There is nothing wrong with the previous code. But overusing these patterns will cause code smells, and won’t necessarily be beneficial. Likewise, misusing them can actually do a lot of harm to your code base, making it brittle, or obfuscating the cause of errors.

The stigma surrounding exceptions often makes programmers feel at a loss. It’s a fact of life that exceptions can’t be avoided, but we are often taught they must be dealt with swiftly and decisively. As we will see, this is not necessarily true. Rather, we should learn the art of handling exceptions gracefully, making them harmonious with the rest of our code.

Following are some recommended practices that will help you embrace exceptions and make use of them and their abilities to keep your code maintainableextensible, and readable:

  • maintainability: Allows us to easily find and fix new bugs, without the fear of breaking current functionality, introducing further bugs, or having to abandon the code altogether due to increased complexity over time.
  • extensibility: Allows us to easily add to our code base, implementing new or changed requirements without breaking existing functionality. Extensibility provides flexibility, and enables a high level of reusability for our code base.
  • readability: Allows us to easily read the code and discover it’s purpose without spending too much time digging. This is critical for efficiently discovering bugs and untested code.

These elements are the main factors of what we might call cleanliness or quality, which is not a direct measure itself, but instead is the combined effect of the previous points, as demonstrated in this comic:

"WTFs/m" by Thom Holwerda, OSNews

With that said, let’s dive into these practices and see how each of them affects those three measures.

Note: We will present examples from Ruby, but all of the constructs demonstrated here have equivalents in the most common OOP languages.

Always create your own ApplicationError hierarchy

Most languages come with a variety of exception classes, organized in an inheritance hierarchy, like any other OOP class. To preserve the readability, maintainability, and extensibility of our code, it’s a good idea to create our own subtree of application-specific exceptions that extend the base exception class. Investing some time in logically structuring this hierarchy can be extremely beneficial. For example:

class ApplicationError < StandardError; end
# Validation Errors
class ValidationError < ApplicationError; end
class RequiredFieldError < ValidationError; end
class UniqueFieldError < ValidationError; end
# HTTP 4XX Response Errors
class ResponseError < ApplicationError; end
class BadRequestError < ResponseError; end
class UnauthorizedError < ResponseError; end
# ...

Example of an application exception hierarchy.

Having an extensible, comprehensive exceptions package for our application makes handling these application-specific situations much easier. For example, we can decide which exceptions to handle in a more natural way. This not only boosts the readability of our code, but also increases the maintainability of our applications and libraries (gems).

From the readability perspective, it’s much easier to read:

rescue ValidationError => e

Than to read:

rescue RequiredFieldError, UniqueFieldError, ... => e

From the maintainability perspective, say, for example, we are implementing a JSON API, and we have defined our own ClientError with several subtypes, to be used when a client sends a bad request. If any one of these is raised, the application should render the JSON representation of the error in its response. It will be easier to fix, or add logic, to a single block that handles ClientErrors rather than looping over each possible client error and implementing the same handler code for each. In terms of extensibility, if we later have to implement another type of client error, we can trust it will already be handled properly here.

Moreover, this does not prevent us from implementing additional special handling for specific client errors earlier in the call stack, or altering the same exception object along the way:

# app/controller/pseudo_controller.rb
def authenticate_user!
fail AuthenticationError if token_invalid? || token_expired?
User.find_by(authentication_token: token)
rescue AuthenticationError => e
report_suspicious_activity if token_invalid?
raise e
end
def show
authenticate_user!
show_private_stuff!(params[:id])
rescue ClientError => e
render_error(e)
end

As you can see, raising this specific exception didn’t prevent us from being able to handle it on different levels, altering it, re-raising it, and allowing the parent class handler to resolve it.

Two things to note here:

  • Not all languages support raising exceptions from within an exception handler.
  • In most languages, raising a new exception from within a handler will cause the original exception to be lost forever, so it’s better to re-raise the same exception object (as in the above example) to avoid losing track of the original cause of the error. (Unless you are doing this intentionally).

Never rescue Exception

That is, never try to implement a catch-all handler for the base exception type. Rescuing or catching all exceptions wholesale is never a good idea in any language, whether it’s globally on a base application level, or in a small buried method used only once. We don’t want to rescue Exception because it will obfuscate whatever really happened, damaging both maintainability and extensibility. We can waste a huge amount of time debugging what the actual problem is, when it could be as simple as a syntax error:

# main.rb
def bad_example
i_might_raise_exception!
rescue Exception
nah_i_will_always_be_here_for_you
end
# elsewhere.rb
def i_might_raise_exception!
retrun do_a_lot_of_work!
end

You might have noticed the error in the previous example; return is mistyped. Although modern editors provide some protection against this specific type of syntax error, this example illustrates how rescue Exception does harm to our code. At no point is the actual type of the exception (in this case a NoMethodError) addressed, nor is it ever exposed to the developer, which may cause us to waste a lot of time running in circles.

Never rescue more exceptions than you need to

The previous point is a specific case of this rule: We should always be careful not to over-generalize our exception handlers. The reasons are the same; whenever we rescue more exceptions than we should, we end up hiding parts of the application logic from higher levels of the application, not to mention suppressing the developer’s ability to handle the exception his or herself. This severely affects the extensibility and maintainability of the code.

If we do attempt to handle different exception subtypes in the same handler, we introduce fat code blocks that have too many responsibilities. For example, if we are building a library that consumes a remote API, handling a MethodNotAllowedError (HTTP 405), is usually different from handling an UnauthorizedError(HTTP 401), even though they are both ResponseErrors.

As we will see, often there exists a different part of the application that would be better suited to handle specific exceptions in a more DRY way.

So, define the single responsibility of your class or method, and handle the bare minimum of exceptions that satisfy this responsibility requirement. For example, if a method is responsible for getting stock info from a remote a API, then it should handle exceptions that arise from getting that info only, and leave the handling of the other errors to a different method designed specifically for these responsibilities:

def get_info
begin
response = HTTP.get(STOCKS_URL + "#{@symbol}/info")
fail AuthenticationError if response.code == 401
fail StockNotFoundError, @symbol if response.code == 404
return JSON.parse response.body
rescue JSON::ParserError
retry
end
end

Here we defined the contract for this method to only get us the info about the stock. It handles endpoint-specific errors, such as an incomplete or malformed JSON response. It doesn’t handle the case when authentication fails or expires, or if the stock doesn’t exist. These are someone else’s responsibility, and are explicitly passed up the call stack where there should be a better place to handle these errors in a DRY way.

Resist the urge to handle exceptions immediately

This is the complement to the last point. An exception can be handled at any point in the call stack, and any point in the class hierarchy, so knowing exactly where to handle it can be mystifying. To solve this conundrum, many developers opt to handle any exception as soon as it arises, but investing time in thinking this through will usually result in finding a more appropriate place to handle specific exceptions.

One common pattern that we see in Rails applications (especially those that expose JSON-only APIs) is the following controller method:

# app/controllers/client_controller.rb
def create
@client = Client.new(params[:client])
if @client.save
render json: @client
else
render json: @client.errors
end
end

(Note that although this is not technically an exception handler, functionally, it serves the same purpose, since @client.save only returns false when it encounters an exception.)

Read the full post in Toptal Engineering blog 



Ruby vs. Python

Toptal freelance experts Damir Zekic and Amar Sahinovic argue the merits of Ruby versus Python, covering everything from speed to performance. Listen to the podcast and weigh in by voting on the superior language and commenting in the thread below. Listen to them debate here



How to Create a Simple Python WebSocket Server Using Tornado

With the increase in popularity of real-time web applications, WebSockets have become a key technology in their implementation. The days where you had to constantly press the reload button to receive updates from the server are long gone. Web applications that want to provide real-time updates no longer have to poll the server for changes - instead, servers push changes down the stream as they happen. Robust web frameworks have begun supporting WebSockets out of the box. Ruby on Rails 5, for example, took it even further and added support for action cables.

In the world of Python, many popular web frameworks exist. Frameworks such as Django provide nearly everything necessary to build web applications, and anything that it lacks can be made up with one of the thousands of plugins available for Django. However, due to the way Python or most of its web frameworks work, handling long lived connections can quickly become a nightmare. The threaded model and global interpreter lock are often considered to be the achilles heel of Python.

But all of that has started to change. With certain new features of Python 3 and frameworks that already exist for Python, such as Tornado, handling long lived connections is a challenge no more. Tornado provides web server capabilities in Python that is specifically useful in handling long-lived connections.

How to Create a Simple Python WebSocket Server using Tornado

In this article, we will take a look at how a simple WebSocket server can be built in Python using Tornado. The demo application will allow us to upload a tab-separated values (TSV) file, parse it and make its contents available at a unique URL.

Tornado and WebSockets

Tornado is an asynchronous network library and specializes in dealing with event driven networking. Since it can naturally hold tens of thousands of open connections concurrently, a server can take advantage of this and handle a lot of WebSocket connections within a single node. WebSocket is a protocol that provides full-duplex communication channels over a single TCP connection. As it is an open socket, this technique makes a web connection stateful and facilitates real-time data transfer to and from the server. The server, keeping the states of the clients, makes it easy to implement real-time chat applications or web games based on WebSockets.

WebSockets are designed to be implemented in web browsers and servers, and is currently supported in all of the major web browsers. A connection is opened once and messages can travel back and forth multiple times before the connection is closed.

Installing Tornado is rather simple. It is listed in PyPI and can be installed using pip or easy_install:

pip install tornado

Tornado comes with its own implementation of WebSockets. For the purposes of this article, this is pretty much all we will need.

WebSockets in Action

One of the advantages of using WebSocket is its stateful property. This changes the way we typically think of client-server communication. One particular use case of this is where the server is required to perform long slow processes and gradually stream results back to the client.

In our example application, the user will be able to upload a file through WebSocket. For the entire lifetime of the connection, the server will retain the parsed file in-memory. Upon requests, the server can then send back parts of the file to the front-end. Furthermore, the file will be made available at a URL which can then be viewed by multiple users. If another file is uploaded at the same URL, everyone looking at it will be able to see the new file immediately.

For front-end, we will use AngularJS. This framework and libraries will allow us to easily handle file uploads and pagination. For everything related to WebSockets, however, we will use standard JavaScript functions.

This simple application will be broken down into three separate files:

  • parser.py: where our Tornado server with the request handlers is implemented
  • templates/index.html: front-end HTML template
  • static/parser.js: For our front-end JavaScript

Opening a WebSocket

From the front-end, a WebSocket connection can be established by instantiating a WebSocket object:

new WebSocket(WEBSOCKET_URL);

This is something we will have to do on page load. Once a WebSocket object is instantiated, handlers must be attached to handle three important events:

  • open: fired when a connection is established
  • message: fired when a message is received from the server
  • close: fired when a connection is closed
$scope.init = function() {
$scope.ws = new WebSocket('ws://' + location.host + '/parser/ws');
$scope.ws.binaryType = 'arraybuffer';
$scope.ws.onopen = function() {
console.log('Connected.')
};
$scope.ws.onmessage = function(evt) {
$scope.$apply(function () {
message = JSON.parse(evt.data);
$scope.currentPage = parseInt(message['page_no']);
$scope.totalRows = parseInt(message['total_number']);
$scope.rows = message['data'];
});
};
$scope.ws.onclose = function() {
console.log('Connection is closed...');
};
}
$scope.init();

Since these event handlers will not automatically trigger AngularJS’s $scope lifecycle, the contents of the handler function needs to be wrapped in $apply. In case you are interested, AngularJS specific packages exist that make it easier to integrate WebSocket in AngularJS applications.

It’s worth mentioning that dropped WebSocket connections are not automatically reestablished, and will require the application to attempt reconnects when the close event handler is triggered. This is a bit beyond the scope of this article.

Selecting a File to Upload

Since we are building a single-page application using AngularJS, attempting to submit forms with files the age-old way will not work. To make things easier, we will use Danial Farid’s ng-file-upload library. Using which, all we need to do to allow a user to upload a file is add a button to our front-end template with specific AngularJS directives:

<button class="btn btn-default" type="file" ngf-select="uploadFile($file, $invalidFiles)"
accept=".tsv" ngf-max-size="10MB">Select File</button>

The library, among many things, allows us to set acceptable file extension and size. Clicking on this button, just like any <input type=”file”> element, will open the standard file picker.

Uploading the File

When you want to transfer binary data, you can choose among array buffer and blob. If it is just raw data like an image file, choose blob and handle it properly in server. Array buffer is for fixed-length binary buffer and a text file like TSV can be transferred in the format of byte string. This code snippet shows how to upload a file in array buffer format.

$scope.uploadFile = function(file, errFiles) {
ws = $scope.ws;
$scope.f = file;
$scope.errFile = errFiles && errFiles[0];
if (file) {
reader = new FileReader();
rawData = new ArrayBuffer();
reader.onload = function(evt) {
rawData = evt.target.result;
ws.send(rawData);
}
reader.readAsArrayBuffer(file);
}
}

The ng-file-upload directive provides an uploadFile function. Here you can transform the file into an array buffer using a FileReader, and send it through the WebSocket.

Note that sending large files over WebSocket by reading them into array buffers may not be the most optimum way to upload them as it can quickly occupy to much memory resulting in a poor experience.

Read the full article inToptal Engineering blog 



Data Mining for Predictive Social Network Analysis

Social networks, in one form or another, have existed since people first began to interact. Indeed, put two or more people together and you have the foundation of a social network. It is therefore no surprise that, in today’s Internet-everywhere world, online social networks have become entirely ubiquitous.

Within this world of online social networks, a particularly fascinating phenomenon of the past decade has been the explosive growth of Twitter, often described as “the SMS of the Internet”. Launched in 2006, Twitter rapidly gained global popularity and has become one of the ten most visited websites in the world. As of May 2015, Twitter boasts 302 million active users who are collectively producing 500 million Tweets per day. And these numbers are continually growing.

Read the full article in Toptal Engineering blog



To Python 3 and Back Again: Is It Worth the Switch?

Python 3 has been in existence for 7 years now, yet some still prefer to use Python 2 instead of the newer version. This is a problem especially for neophytes that are approaching Python for the first time. I realized this at my previous workplace with colleagues in the exact same situation. Not only were they unaware of the differences between the two versions, they were not even aware of the version that they had installed.

Inevitably, different colleagues had installed different versions of the interpreter. That was a recipe for disaster if they would’ve then tried to blindly share the scripts between them.

This wasn’t quite their fault, on the contrary. A greater effort for documenting and raising awareness is needed to dispel that veil of FUD (fear, uncertainty and doubt) that sometimes affects our choices. This post is thus thought for them, or for those who already use Python 2 but aren’t sure about moving to the next version, maybe because they tried version 3 only at the beginning when it was less refined and support for libraries was worse.

Read the full article in the Toptal Engineering blog


An Introduction to Mocking in Python

How to Run Unit Tests Without Testing Your Patience

More often than not, the software we write directly interacts with what we would label as “dirty” services. In layman’s terms: services that are crucial to our application, but whose interactions have intended but undesired side-effects—that is, undesired in the context of an autonomous test run.

For example: perhaps we’re writing a social app and want to test out our new ‘Post to Facebook feature’, but don’t want to actually post to Facebook every time we run our test suite.

The Python unittest library includes a subpackage named unittest.mock—or if you declare it as a dependency, simply mock—which provides extremely powerful and useful means by which to mock and stub out these undesired side-effects.

mocking and unit tests in python

Note: mock is newly included in the standard library as of Python 3.3; prior distributions will have to use the Mock library downloadable via PyPI.

Fear System Calls

To give you another example, and one that we’ll run with for the rest of the article, consider system calls. It’s not difficult to see that these are prime candidates for mocking: whether you’re writing a script to eject a CD drive, a web server which removes antiquated cache files from /tmp, or a socket server which binds to a TCP port, these calls all feature undesired side-effects in the context of your unit-tests.

As a developer, you care more that your library successfully called the system function for ejecting a CD as opposed to experiencing your CD tray open every time a test is run.

As a developer, you care more that your library successfully called the system function for ejecting a CD (with the correct arguments, etc.) as opposed to actually experiencing your CD tray open every time a test is run. (Or worse, multiple times, as multiple tests reference the eject code during a single unit-test run!)

Likewise, keeping your unit-tests efficient and performant means keeping as much “slow code” out of the automated test runs, namely filesystem and network access.

For our first example, we’ll refactor a standard Python test case from original form to one using mock. We’ll demonstrate how writing a test case with mocks will make our tests smarter, faster, and able to reveal more about how the software works.

Read the full article in Toptal Engineering blog 



How I Made Porn 20x More Efficient with Python Video Streaming

Intro

Porn is a big industry. There aren’t many sites on the Internet that can rival the traffic of its biggest players.

And juggling this immense traffic is tough. To make things even harder, much of the content served from porn sites is made up of low latency live video streams rather than simple static video content. But for all of the challenges involved, rarely have I read about the python developers who take them on. So I decided to write about my own experience on the job.

What’s the problem?

A few years ago, I was working for the 26th (at the time) most visited website in the world—not just the porn industry: the world.

At the time, the site served up porn video streaming requests with the Real Time Messaging protocol (RTMP). More specifically, it used a Flash Media Server (FMS) solution, built by Adobe, to provide users with live streams. The basic process was as follows:

  1. The user requests access to some live stream
  2. The server replies with an RTMP session playing the desired footage

For a couple reasons, FMS wasn’t a good choice for us, starting with its costs, which included the purchasing of both:

  1. Windows licenses for every machine on which we ran FMS.
  2. ~$4k FMS-specific licenses, of which we had to purchase several hundred (and more every day) due to our scale.

All of these fees began to rack up. And costs aside, FMS was a lacking product, especially in its functionality (more on this in a bit). So I decided to scrap FMS and write my own Python RTMP parser from scratch.

In the end, I managed to make our service roughly 20x more efficient.

Getting started

There were two core problems involved: firstly, RTMP and other Adobe protocols and formats were not open (i.e., publically available), which made them hard to work with. How can you reverse or parse files in a format about which you know nothing? Luckily, there were some reversing efforts available in the public sphere (not produced by Adobe, but rather by a group called OS Flash, now defunct) on which we based our work.

Note: Adobe later released “specifications” which contained no more information than what was already disclosed in the non-Adobe-produced reversing wiki and documents. Their (Adobe’s) specifications were of an absurdly low quality and made it near impossible to actually use their libraries. Moreover, the protocol itself seemed intentionally misleading at times. For example:

  1. They used 29-bit integers.
  2. They included protocol headers with big endian formatting everywhere—except for a specific (yet unmarked) field, which was little endian.
  3. They squeezed data into less space at the cost of computational power when transporting 9k video frames, which made little to no sense, because they were earning back bits or bytes at a time—insignificant gains for such a file size.

And secondly: RTMP is highly session oriented, which made it virtually impossible to multicast an incoming stream. Ideally, if multiple users wanted to watch the same live stream, we could just pass them back pointers to a single session in which that stream is being aired (this would be multicast video streaming). But with RTMP, we had to create an entirely new instance of the stream for every user that wanted access. This was a complete waste.

Three users demonstrating the difference between a multicast video streaming solution and a FMS streaming problem.

 

Read the full article in Toptal Engineering blog 



Django, Flask, and Redis Tutorial: Web Application Session Management Between Python Frameworks

Django Versus Flask: When Django is the Wrong Choice

I love and use Django in lots of my personal and client projects, mostly for more classical web applications and those involving relational databases. However, Django is no silver bullet.

By design, Django is very tightly coupled with its ORM, Template Engine System, and Settings object. Plus, it’s not a new project: it carries a lot of baggage to remain backwards compatible.

Some Python developers see this as a major problem. They say that Django isn’t flexible enough and avoid it if possible and, instead, use a Python microframework like Flask.

I don’t share that opinion. Django is great when used in the appropriate place and time, even if it doesn’t fit into every project spec. As the mantra goes: “Use the right tool for the job”.

(Even when it is not the right place and time, sometimes programming with Django can have unique benefits.)

In some cases, it can indeed be nice to use a more lightweight framework (like Flask). Often, these microframeworks start to shine when you realize how easy they are to hack on.

Microframeworks to the Rescue

In a few of my client projects, we’ve discussed giving up on Django and moving to a microframework, typically when the clients want to do some interesting stuff (in one case, for example, embedding ZeroMQ in the application object) and the project goals seem more difficult to achieve with Django.

More generally, I find Flask useful for:

  • Simple REST API backends
  • Applications that don’t require database access
  • NoSQL-based web apps
  • Web apps with very specific requirements, like custom URL configurations

At the same time, our app required user registration and other common tasks that Django solved years ago. Given its light weight, Flask doesn’t come with the same toolkit.

The question emerged: is Django an all-or-nothing deal?

The question emerged: is Django an all-or-nothing deal? Should we drop it completely from the project, or can we learn to combine it with the flexibility of other microframeworks or traditional frameworks? Can we pick and choose the pieces we want to use and eschew others?

Can we have the best of both worlds? I say yes, especially when it comes to session management.

(Not to mention, there are a lot of projects out there for Django freelancers.)

Now the Python Tutorial: Sharing Django Sessions

The goal of this post is to delegate the tasks of user authentication and registration to Django, yet use Redis to share user sessions with other frameworks. I can think of a few scenarios in which something like this would be useful:

  • You need to develop a REST API separately from your Django app but want to share session data.
  • You have a specific component that may need to be replaced later on or scaled out for some reason and still need session data.

For this tutorial, I’ll use Redis to share sessions between two frameworks (in this case, Django and Flask). In the current setup, I’ll use SQLite to store user information, but you can have your back-end tied to a NoSQL database (or a SQL-based alternative) if need be.

Understanding Sessions

To share sessions between Django and Flask, we need to know a bit about how Django stores its session information. The Django docs are pretty good, but I’ll provide some background for completeness.

Session Management Varieties

Generally, you can choose to manage your Python app’s session data in one of two ways:

  • Cookie-based sessions: In this scenario, the session data is not stored in a data store on the back-end. Instead, it’s serialized, signed (with a SECRET_KEY), and sent to the client. When the client sends that data back, its integrity is checked for tampering and it is deserialized again on the server.

  • Storage-based sessions: In this scenario, the session data itself is not sent to the client. Instead, only a small portion is sent (a key) to indicate the identity of the current user, stored on the session store.

In our example, we’re more interested in the latter scenario: we want our session data to be stored on the back-end and then checked in Flask. The same thing could be done in the former, but as the Django documentation mentions, there are some concerns about the security of the first method.

The General Workflow

The general workflow of session handling and management will be similar to this diagram:

A diagram showing the management of user sessions between Flask and Django using Redis.

Let’s walk through session sharing in a little more detail:

  1. When a new request comes in, the first step is to send it through the registered middleware in the Django stack. We’re interested here in the SessionMiddleware class which, as you might expect, is related to session management and handling:

    	class SessionMiddleware(object):
    	def process_request(self, request):
    	engine = import_module(settings.SESSION_ENGINE)
    	session_key = request.COOKIES.get(settings.SESSION_COOKIE_NAME, None)
    	request.session = engine.SessionStore(session_key)
    	
    	

    In this snippet, Django grabs the registered SessionEngine (we’ll get to that soon), extracts the SESSION_COOKIE_NAME from request (sessionid, by default) and creates a new instance of the selected SessionEngine to handle session storage.

  • Later on (after the user view is processed, but still in the middleware stack), the session engine calls its save method to save any changes to the data store. (During view handling, the user may have changed a few things within the session, e.g., by adding a new value to session object with request.session.) Then, the SESSION_COOKIE_NAME is sent to the client. Here’s the simplified version:

    	def process_response(self, request, response):
    	....
    	if response.status_code != 500:
    	request.session.save()
    	response.set_cookie(settings.SESSION_COOKIE_NAME,
    	request.session.session_key, max_age=max_age,
    	expires=expires, domain=settings.SESSION_COOKIE_DOMAIN,
    	path=settings.SESSION_COOKIE_PATH,
    	secure=settings.SESSION_COOKIE_SECURE or None,
    	httponly=settings.SESSION_COOKIE_HTTPONLY or None)
    	return response
    	
    	

We’re particularly interested in the SessionEngine class, which we’ll replace with something to store and load data to and from a Redis back-end.

Fortunately, there are a few projects that already handle this for us. Here’s an example from redis_sessions_fork. Pay close attention to the  save and load methods, which are written so as to (respectively) store and load the session into and from Redis:

class SessionStore(SessionBase):
"""
Redis session back-end for Django
"""
def __init__(self, session_key=None):
super(SessionStore, self).__init__(session_key)
def _get_or_create_session_key(self):
if self._session_key is None:
self._session_key = self._get_new_session_key()
return self._session_key
def load(self):
session_data = backend.get(self.session_key)
if not session_data is None:
return self.decode(session_data)
else:
self.create()
return {}
def exists(self, session_key):
return backend.exists(session_key)
def create(self):
while True:
self._session_key = self._get_new_session_key()
try:
self.save(must_create=True)
except CreateError:
continue
self.modified = True
self._session_cache = {}
return
def save(self, must_create=False):
session_key = self._get_or_create_session_key()
expire_in = self.get_expiry_age()
session_data = self.encode(self._get_session(no_load=must_create))
backend.save(session_key, expire_in, session_data, must_create)
def delete(self, session_key=None):
if session_key is None:
if self.session_key is None:
return
session_key = self.session_key
backend.delete(session_key)

It’s important to understand how this class is operating as we’ll need to implement something similar on Flask to load session data. Let’s take a closer look with a REPL example:

>>> from django.conf import settings
>>> from django.utils.importlib import import_module
>>> engine = import_module(settings.SESSION_ENGINE)
>>> engine.SessionStore()
<redis_sessions_fork.session.SessionStore object at 0x3761cd0>
>>> store["count"] = 1
>>> store.save()
>>> store.load()
{u'count': 1}

The session store’s interface is pretty easy to understand, but there’s a lot going on under the hood. We should dig a little deeper so that we can implement something similar on Flask.

Note: You might ask, “Why not just copy the SessionEngine into Flask?” Easier said than done. As we discussed in the beginning, Django is tightly coupled with its Settings object, so you can’t just import some Django module and use it without any additional work.

Django Session (De-)Serialization

As I said, Django does a lot of work to mask the complexity of its session storage. Let’s check the Redis key that’s stored in the above snippets:

>>> store.session_key
u"ery3j462ezmmgebbpwjajlxjxmvt5adu"

Now, lets query that key on the redis-cli:

redis 127.0.0.1:6379> get "django_sessions:ery3j462ezmmgebbpwjajlxjxmvt5adu"
"ZmUxOTY0ZTFkMmNmODA2OWQ5ZjE4MjNhZmQxNDM0MDBiNmQzNzM2Zjp7ImNvdW50IjoxfQ=="

What we see here is a very long, Base64-encoded string. To understand its purpose, we need to look at Django’s SessionBase class to see how it’s handled:

class SessionBase(object):
"""
Base class for all Session classes.
"""
def encode(self, session_dict):
"Returns the given session dictionary serialized and encoded as a string."
serialized = self.serializer().dumps(session_dict)
hash = self._hash(serialized)
return base64.b64encode(hash.encode() + b":" + serialized).decode('ascii')
def decode(self, session_data):
encoded_data = base64.b64decode(force_bytes(session_data))
try:
hash, serialized = encoded_data.split(b':', 1)
expected_hash = self._hash(serialized)
if not constant_time_compare(hash.decode(), expected_hash):
raise SuspiciousSession("Session data corrupted")
else:
return self.serializer().loads(serialized)
except Exception as e:
# ValueError, SuspiciousOperation, unpickling exceptions
if isinstance(e, SuspiciousOperation):
logger = logging.getLogger('django.security.%s' %
e.__class__.__name__)
logger.warning(force_text(e))
return {}

The encode method first serializes the data with the current registered serializer. In other words, it converts the session into a string, which it can later convert back into a session (look at the SESSION_SERIALIZER documentation for more). Then, it hashes the serialized data and uses this hash later on as a signature to check the integrity of the session data. Finally, it returns that data pair to the user as a Base64-encoded string.

By the way: before version 1.6, Django defaulted to using pickle for serialization of session data. Due to security concerns, the default serialization method is now django.contrib.sessions.serializers.JSONSerializer.

Encoding an Example Session

Let’s see the session management process in action. Here, our session dictionary will simply be a count and some integer, but you can imagine how this would generalize to more complicated user sessions.

>>> store.encode({'count': 1})
u'ZmUxOTY0ZTFkMmNmODA2OWQ5ZjE4MjNhZmQxNDM0MDBiNmQzNzM2Zjp7ImNvdW50IjoxfQ=='
>>> base64.b64decode(encoded)
'fe1964e1d2cf8069d9f1823afd143400b6d3736f:{"count":1}'

The result of the store method (u’ZmUxOTY…==’) is an encoded string containing the serialized user session and its hash. When we decode it, we indeed get back both the hash (‘fe1964e…’) and the session ({"count":1}).

Note that the decode method checks to ensure that the hash is correct for that session, guaranteeing integrity of the data when we go to use it in Flask. In our case, we’re not too worried about our session being tampered with on the client side because:

  • We aren’t using cookie-based sessions, i.e., we’re not sending all user data to the client.

  • On Flask, we’ll need a read-only SessionStore which will tell us if given key exists or not and return the stored data.

Extending to Flask

Next, let’s create a simplified version of the Redis session engine (database) to work with Flask. We’ll use the same SessionStore (defined above) as a base class, but we’ll need to remove some of its functionality, e.g., checking for bad signatures or modifying sessions. We’re more interested in a read-only SessionStore that will load the session data saved from Django. Let’s see how it comes together:

class SessionStore(object):
# The default serializer, for now
def __init__(self, conn, session_key, secret, serializer=None):
self._conn = conn
self.session_key = session_key
self._secret = secret
self.serializer = serializer or JSONSerializer
def load(self):
session_data = self._conn.get(self.session_key)
if not session_data is None:
return self._decode(session_data)
else:
return {}
def exists(self, session_key):
return self._conn.exists(session_key)
def _decode(self, session_data):
"""
Decodes the Django session
:param session_data:
:return: decoded data
"""
encoded_data = base64.b64decode(force_bytes(session_data))
try:
# Could produce ValueError if there is no ':'
hash, serialized = encoded_data.split(b':', 1)
# In the Django version of that they check for corrupted data
# I don't find it useful, so I'm removing it
return self.serializer().loads(serialized)
except Exception as e:
# ValueError, SuspiciousOperation, unpickling exceptions. If any of
# these happen, return an empty dictionary (i.e., empty session).
return {}

We only need the load method because it’s a read-only implementation of the storage. That means you can’t logout directly from Flask; instead, you might want to redirect this task to Django. Remember, the goal here is to manage sessions between these two Python frameworks to give you more flexibility.

Flask Sessions

The Flask microframework supports cookie-based sessions, which means all of the session data is sent to the client, Base64-encoded and cryptographically signed. But actually, we’re not very interested in Flask’s session support.

What we need is to get the session ID created by Django and check it against the Redis back-end so that we can be sure the request belongs to a pre-signed user. In summary, the ideal process would be (this syncs up with the diagram above):

  • We grab the Django session ID from the user’s cookie.
  • If the session ID is found in Redis, we return the session matching that ID.
  • If not, we redirect them to a login page.

It’ll be handy to have a decorator to check for that information and set the current user_id into the gvariable in Flask:

from functools import wraps
from flask import g, request, redirect, url_for
def login_required(f):
@wraps(f)
def decorated_function(*args, **kwargs):
djsession_id = request.cookies.get("sessionid")
if djsession_id is None:
return redirect("/")
key = get_session_prefixed(djsession_id)
session_store = SessionStore(redis_conn, key)
auth = session_store.load()
if not auth:
return redirect("/")
g.user_id = str(auth.get("_auth_user_id"))
return f(*args, **kwargs)
return decorated_function

In the example above, we’re still using the SessionStore we defined previously to fetch the Django data from Redis. If the session has an _auth_user_id, we return the content from the view function; otherwise, the user is redirected to a login page, just like we wanted.

Gluing Things Together

In order to share cookies, I find it convenient to start Django and Flask via a WSGI server and glue them together. In this example, I’ve used CherryPy:

from app import app
from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()
d = wsgiserver.WSGIPathInfoDispatcher({
"/":application,
"/backend":app
})
server = wsgiserver.CherryPyWSGIServer(("127.0.0.1", 8080), d)

With that, Django will serve on “/” and Flask will serve on “/backend” endpoints.

In Conclusion

Rather than examining Django versus Flask or encouraging you only to learn the Flask microframework, I’ve welded together Django and Flask, getting them to share the same session data for authentication by delegating the task to Django. As Django ships with plenty of modules to solve user registration, login, and logout (just to name a few), combining these two frameworks will save you valuable time while providing you with the opportunity to hack on a manageable microframework like Flask.

This post originally appeared in the Toptal Engineering blog 



The Python Tutorial

Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.

The Python interpreter and the extensive standard library are freely available in source or binary form for all major platforms from the Python Web site,https://www.python.org/, and may be freely distributed. The same site also contains distributions of and pointers to many free third party Python modules, programs and tools, and additional documentation.

The Python interpreter is easily extended with new functions and data types implemented in C or C++ (or other languages callable from C). Python is also suitable as an extension language for customizable applications.

This tutorial introduces the reader informally to the basic concepts and features of the Python language and system. It helps to have a Python interpreter handy for hands-on experience, but all examples are self-contained, so the tutorial can be read off-line as well.

For a description of standard objects and modules, see The Python Standard LibraryThe Python Language Reference gives a more formal definition of the language. To write extensions in C or C++, read Extending and Embedding the Python Interpreter and Python/C API Reference Manual. There are also several books covering Python in depth.

This tutorial does not attempt to be comprehensive and cover every single feature, or even every commonly used feature. Instead, it introduces many of Python’s most noteworthy features, and will give you a good idea of the language’s flavor and style. After reading it, you will be able to read and write Python modules and programs, and you will be ready to learn more about the various Python library modules described in The Python Standard Library.



A Tutorial for Reverse Engineering Your Software's Private API: Hacking Your Couch

Traveling is my passion, and I’m a huge fan of Couchsurfing. Couchsurfing is a global community of travelers, where you can find a place to stay or share your own home with other travelers. On top of that, Couchsurfing helps you enjoy a genuine traveling experience while interacting with locals. I’ve been involved with the Couchsurfing community for over 3 years. I attended meetups at first, and then I was finally able to host people. What an amazing journey it was! I’ve met so many incredible people from all over the world and made lots of friends. This whole experience truly changed my life.

Reverse engineering software is fun with a tutorial and a good project idea.

I’ve hosted a lot of travelers myself, much more than I’ve actually surfed yet. While living in one of the major touristic destinations on the French Riviera, I received an enormous amount of couch requests (up to 10 a day during high season). As a freelance back-end developer, I immediately noticed that the problem with the couchsurfing.com website is that it doesn’t really handle such “high-load” cases properly. There is no information about the availability of your couch - when you receive a new couch request you can’t be sure if you are already hosting someone at that time. There should be a visual representation of your accepted and pending requests, so you can manage them better. Also, if you could make your couch availability public, you could avoid unnecessary couch requests. To better understand what I have in mind take a look at Airbnb calendar.

Lots of companies are notorious for not listening to their users. Knowing the history of Couchsurfing, I couldn’t count on them to implement this feature anytime soon. Ever since the website became a for-profit company, the community deteriorated. To better understand what I’m talking about, I suggest reading these two articles:

I knew that lot of community members would be happy to have this functionality. So, I decided to make an app to solve this problem. It turns out there is no public Couchsurfing API available. Here is the response I’ve received from their support team:

“Unfortunately we have to inform you that our API is not actually public and there are no plans at the moment to make it public.”

Breaking Into My Couch

It was time to use some of my favorite software reverse engineering techniques to break into Couchsurfing.com. I assumed that their mobile apps must use some sort of API to query the backend. So, I had to intercept the HTTP requests coming from a mobile app to the backend. For that purpose I set up a proxy in the local network, and connected my iPhone to it to intercept HTTP requests. This way, I was able to find access points of their private API and figure out their JSON payload format.

Finally I created a website which serves the purpose of helping people manage their couch requests, and show surfers a couch availability calendar. I published a link to it on the community forums (which are also quite segmented in my opinion, and it’s difficult to find information there). The reception was mostly positive, although some people didn’t like the idea that the website required couchsurfing.com credentials, which was a matter of trust really.

The website worked like this: you log in to the website with your couchsurfing.com credentials, and after a few clicks you get the html code which you can embed into your couchsurfing.com profile, and voila - you have an automatically updated calendar in your profile. Below is the screenshot of the calendar and here the articles on how I made it:

I’ve created a great feature for Couchsurfing, and I naturally assumed that they would appreciate my work - perhaps even offer me a position in their development team. I’ve sent an email to jobs(at)couchsurfing.comwith a link to the website, my resume, and a reference. A thank-you note left by one of my couchsurfing guests:

A few days later they followed up on my reverse engineering efforts. In the reply it was clear that the only thing they were concerned about was their own security, so they asked me to take down the blog posts I’ve written about the API, and eventually the website. I’ve taken down the posts immediately, as my intention was not to violate the terms of use and fish for user credentials, but rather to help the couchsurfing community. I had an impression that I was treated as a criminal, and the company focused solely the fact that my website requires user credentials.

 Read the full article at Toptal Engineering blog



Python Design Patterns: For Sleek And Fashionable Code

Python is a dynamic and flexible language. Python design patterns are a great way of harnessing its vast potential.Python’s philosophy is built on top of the idea of well thought out best practices. Python is a dynamic language (did I already said that?) and as such, already implements, or makes it easy to implement, a number of popular design patterns with a few lines of code. Some design patterns are built into Python, so we use them even without knowing. Other patterns are not needed due of the nature of the language.

For example, Factory is a structural Python design pattern aimed at creating new objects, hiding the instantiation logic from the user. But creation of objects in Python is dynamic by design, so additions like Factory are not necessary. Of course, you are free to implement it if you want to. There might be cases where it would be really useful, but they’re an exception, not the norm.

What is so good about Python’s philosophy? Let’s start with this (explore it in the Python terminal):

>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

These might not be patterns in the traditional sense, but these are rules that define the “Pythonic” approach to programming in the most elegant and useful fashion.

We have also PEP-8 code guidelines that help structure our code. It’s a must for me, with some appropriate exceptions, of course. By the way, these exceptions are encouraged by PEP-8 itself:

“But most importantly: know when to be inconsistent – sometimes the style guide just doesn’t apply. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don’t hesitate to ask!”

Combine PEP-8 with The Zen of Python (also a PEP - PEP-20), and you’ll have a perfect foundation to create readable and maintainable code. Add Design Patterns and you are ready to create every kind of software system with consistency and evolvability.

Python Design Patterns

What Is A Design Pattern?

Everything starts with the Gang of Four (GOF). Do a quick online search if you are not familiar with the GOF.

Design patterns are a common way of solving well known problems. Two main principles are in the bases of the design patterns defined by the GOF:

  • Program to an interface not an implementation.
  • Favor object composition over inheritance.

Let’s take a closer look at these two principles from the perspective of Python programmers.

Program to an interface not an implementation

Think about Duck Typing. In Python we don’t like to define interfaces and program classes according these interfaces, do we? But, listen to me! This doesn’t mean we don’t think about interfaces, in fact with Duck Typing we do that all the time.

Let’s say some words about the infamous Duck Typing approach to see how it fits in this paradigm: program to an interface.

If it looks like a duck and quacks like a duck, it's a duck!

If it looks like a duck and quacks like a duck, it's a duck!

We don’t bother with the nature of the object, we don’t have to care what the object is; we just want to know if it’s able to do what we need (we are only interested in the interface of the object).

Can the object quack? So, let it quack!

try:
bird.quack()
except AttributeError:
self.lol()

Did we define an interface for our duck? No! Did we program to the interface instead of the implementation? Yes! And, I find this so nice.

As Alex Martelli points out in his well known presentation about Design Patterns in Python, “Teaching the ducks to type takes a while, but saves you a lot of work afterwards!”

Favor object composition over inheritance

Now, that’s what I call a Pythonic principle! I have created fewer classes/subclasses compared to wrapping one class (or more often, several classes) in another class.

Instead of doing this:

class User(DbObject):
pass

We can do something like this:

class User:
_persist_methods = ['get', 'save', 'delete']
def __init__(self, persister):
self._persister = persister
def __getattr__(self, attribute):
if attribute in self._persist_methods:
return getattr(self._persister, attribute)

The advantages are obvious. We can restrict what methods of the wrapped class to expose. We can inject the persister instance in runtime! For example, today it’s a relational database, but tomorrow it could be whatever, with the interface we need (again those pesky ducks).

Composition is elegant and natural to Python.

Behavioral Patterns

Behavioural Patterns involve communication between objects, how objects interact and fulfil a given task. According to GOF principles, there are a total of 11 behavioral patterns in Python: Chain of responsibility, Command, Interpreter, Iterator, Mediator, Memento, Observer, State, Strategy, Template, Visitor.

Behavioural patterns deal with inter-object communication, controlling how various objects interact and perform different tasks.

Behavioural patterns deal with inter-object communication, controlling how various objects interact and perform different tasks.

Read the whole article here by Andrei Boyanov, Toptal Freelance Developer



WSGI: The Server-Application Interface for Python

In 1993, the web was still in its infancy, with about 14 million users and a hundred websites. Pages were static but there was already a need to produce dynamic content, such as up-to-date news and data. Responding to this, Rob McCool and other contributors implemented the Common Gateway Interface (CGI) in the National Center for Supercomputing Applications (NCSA) HTTPd web server (the forerunner of Apache). This was the first web server that could serve content generated by a separate application.

Since then, the number of users on the Internet has exploded, and dynamic websites have become ubiquitous. When first learning a new language or even first learning to code, developers, soon enough, want to know about how to hook their code into the web.

Source: Toptal 



8 Essential Python Interview Questions

Source: Toptal 

 

What will be the output of the code below? Explain your answer.

def extendList(val, list=[]):
list.append(val)
return list
list1 = extendList(10)
list2 = extendList(123,[])
list3 = extendList('a')
print "list1 = %s" % list1
print "list2 = %s" % list2
print "list3 = %s" % list3

How would you modify the definition of extendList to produce the presumably desired behavior?

What will be the output of the code below? Explain your answer.

def multipliers():
return [lambda x : i * x for i in range(4)]
print [m(2) for m in multipliers()]

How would you modify the definition of multipliers to produce the presumably desired behavior?

What will be the output of the code below? Explain your answer.

class Parent(object):
x = 1
class Child1(Parent):
pass
class Child2(Parent):
pass
print Parent.x, Child1.x, Child2.x
Child1.x = 2
print Parent.x, Child1.x, Child2.x
Parent.x = 3
print Parent.x, Child1.x, Child2.x

This post originally appeared on the Toptal Engineering blog 

 

What will be the output of the code below in Python 2? Explain your answer.

def div1(x,y):
print "%s/%s = %s" % (x, y, x/y)
def div2(x,y):
print "%s//%s = %s" % (x, y, x//y)
div1(5,2)
div1(5.,2)
div2(5,2)
div2(5.,2.)

Also, how would the answer differ in Python 3 (assuming, of course, that the above print statements were converted to Python 3 syntax)?

What will be the output of the code below?

list = ['a', 'b', 'c', 'd', 'e']
print list[10:]

Consider the following code snippet:

1. list = [ [ ] ] * 5
2. list  # output?
3. list[0].append(10)
4. list  # output?
5. list[1].append(20)
6. list  # output?
7. list.append(30)
8. list  # output?

What will be the ouput of lines 2, 4, 6, and 8? Explain your answer.

Given a list of N numbers, use a single list comprehension to produce a new list that only contains those values that are:
(a) even numbers, and
(b) from elements in the original list that had even indices

For example, if list[2] contains a value that is even, that value should be included in the new list, since it is also at an even index (i.e., 2) in the original list. However, if list[3] contains an even number, that number should not be included in the new list since it is at an odd index (i.e., 3) in the original list.

Given the following subclass of dictionary:

class DefaultDict(dict):
def __missing__(self, key):
return []

Will the code below work? Why or why not?

d = DefaultDict()
d['florp'] = 127
 
 


Python Multithreading Tutorial: Concurrency and Parallelism

Discussions criticizing Python often talk about how it is difficult to use Python for multithreaded work, pointing fingers at what is known as the global interpreter lock (affectionately referred to as the “GIL”) that prevents multiple threads of Python code from running simultaneously. Due to this, the threading module doesn’t quite behave the way you would expect it to if you’re not a Python developer and you are coming from other languages such as C++ or Java. It must be made clear that one can still write code in Python that runs concurrently or in parallel and make a stark difference resulting performance, as long as certain things are taken into consideration. If you haven’t read it yet, I suggest you take a look at Eqbal Quran’s article on concurrency and parallelism in Ruby here on the Toptal blog.

In this Python concurrency tutorial, we will write a small Python script to download the top popular images from Imgur. We will start with a version that downloads images sequentially, or one at a time. As a prerequisite, you will have to register an application on Imgur. If you do not have an Imgur account already, please create one first.

The scripts in this tutorial has been tested with Python 3.4.2. With some changes, they should also run with Python 2 - urllib is what has changed the most between these two versions of Python.

Getting Started with Multithreading in Python

Let us start by creating a Python module, named “download.py”. This file will contain all the functions necessary to fetch the list of images and download them. We will split these functionalities into three separate functions:

  • get_links
  • download_link
  • setup_download_dir

The third function, “setup_download_dir”, will be used to create a download destination directory if it doesn’t already exist.

Imgur’s API requires HTTP requests to bear the “Authorization” header with the client ID. You can find this client ID from the dashboard of the application that you have registered on Imgur, and the response will be JSON encoded. We can use Python’s standard JSON library to decode it. Downloading the image is an even simpler task, as all you have to do is fetch the image by its URL and write it to a file.

Python multithreading

This is what the script looks like:

import json
import logging
import os
from pathlib import Path
from urllib.request import urlopen, Request
logger = logging.getLogger(__name__)
def get_links(client_id):
headers = {'Authorization': 'Client-ID {}'.format(client_id)}
req = Request('https://api.imgur.com/3/gallery/', headers=headers, method='GET')
with urlopen(req) as resp:
data = json.loads(resp.readall().decode('utf-8'))
return map(lambda item: item['link'], data['data'])
def download_link(directory, link):
logger.info('Downloading %s', link)
download_path = directory / os.path.basename(link)
with urlopen(link) as image, download_path.open('wb') as f:
f.write(image.readall())
def setup_download_dir():
download_dir = Path('images')
if not download_dir.exists():
download_dir.mkdir()
return download_dir

Next, we will need to write a module that will use these functions to download the images, one by one. We will name this “single.py”. This will contain the main function of our first, naive version of the Imgur image downloader. The module will retrieve the Imgur client ID in the environment variable “IMGUR_CLIENT_ID”. It will invoke the “setup_download_dir” to create the download destination directory. Finally, it will fetch a list of images using the get_links function, filter out all GIF and album URLs, and then use “download_link” to download and save each of those images to the disk. Here is what “single.py” looks like:

import logging
import os
from time import time
from download import setup_download_dir, get_links, download_link
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logging.getLogger('requests').setLevel(logging.CRITICAL)
logger = logging.getLogger(__name__)
def main():
ts = time()
client_id = os.getenv('IMGUR_CLIENT_ID')
if not client_id:
raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
download_dir = setup_download_dir()
links = [l for l in get_links(client_id) if l.endswith('.jpg')]
for link in links:
download_link(download_dir, link)
print('Took {}s'.format(time() - ts))
if __name__ == '__main__':
main()

On my laptop, this script took 19.4 seconds to download 91 images. Please do note that these numbers may vary based on the network you are on. 19.4 seconds isn’t terribly long, but what if we wanted to download more pictures? Perhaps 900 images, instead of 90. With an average of 0.2 seconds per picture, 900 images would take approximately 3 minutes. For 9000 pictures it would take 30 minutes. The good news is that by introducing concurrency or parallelism, we can speed this up dramatically.

All subsequent code examples will only show import statements that are new and specific to those. For convenience, all of these Python scripts can be found in this GitHub repository.

Using Threads for Concurrency and Parallelism

Threading is one of the most well known approaches to attaining Python concurrency and parallelism. Threading is a feature usually provided by the operating system. Threads are lighter than processes, and share the same memory space.

Threading - Python concurrency and parallelism

In our Python thread tutorial, we will write a new module to replace “single.py”. This module will create a pool of 8 threads, making a total of 9 threads including the main thread. I chose 8 worker threads, because my computer has 8 CPU cores and one worker thread per core seemed a good number for how many threads to run at once. In practice, this number is chosen much more carefully based on other factors, such as other applications and services running on the same machine.

This is almost the same as the previous one, with the exception that we now have a new class, DownloadWorker, that is a descendent of the Thread class. The run method has been overridden, which runs an infinite loop. On every iteration, it calls “self.queue.get()” to try and fetch an URL to from a thread-safe queue. It blocks until there is an item in the queue for the worker to process. Once the worker receives an item from the queue, it then calls the same “download_link” method that was used in the previous script to download the image to the images directory. After the download is finished, the worker signals the queue that that task is done. This is very important, because the Queue keeps track of how many tasks were enqueued. The call to “queue.join()” would block the main thread forever if the workers did not signal that they completed a task.

from queue import Queue
from threading import Thread
class DownloadWorker(Thread):
def __init__(self, queue):
Thread.__init__(self)
self.queue = queue
def run(self):
while True:
# Get the work from the queue and expand the tuple
directory, link = self.queue.get()
download_link(directory, link)
self.queue.task_done()
def main():
ts = time()
client_id = os.getenv('IMGUR_CLIENT_ID')
if not client_id:
raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
download_dir = setup_download_dir()
links = [l for l in get_links(client_id) if l.endswith('.jpg')]
# Create a queue to communicate with the worker threads
queue = Queue()
# Create 8 worker threads
for x in range(8):
worker = DownloadWorker(queue)
# Setting daemon to True will let the main thread exit even though the workers are blocking
worker.daemon = True
worker.start()
# Put the tasks into the queue as a tuple
for link in links:
logger.info('Queueing {}'.format(link))
queue.put((download_dir, link))
# Causes the main thread to wait for the queue to finish processing all the tasks
queue.join()
print('Took {}'.format(time() - ts))

Running this script on the same machine used earlier results in a download time of 4.1 seconds! Thats 4.7 times faster than the previous example. While this is much faster, it is worth mentioning that only one thread was executing at a time throughout this process due to the GIL. Therefore, this code is concurrent but not parallel. The reason it is still faster is because this is an IO bound task. The processor is hardly breaking a sweat while downloading these images, and the majority of the time is spent waiting for the network. This is why threading can provide a large speed increase. The processor can switch between the threads whenever one of them is ready to do some work. Using the threading module in Python or any other interpreted language with a GIL can actually result in reduced performance. If your code is performing a CPU bound task, such as decompressing gzip files, using the threading module will result in a slower execution time. For CPU bound tasks and truly parallel execution, we can use the multiprocessing module.

While the de facto reference Python implementation - CPython - has a GIL, this is not true of all Python implementations. For example, IronPython, a Python implementation using the .NET framework does not have a GIL, and neither does Jython, the Java based implementation. You can find a list of working Python implementations here.

Spawning Multiple Processes

The multiprocessing module is easier to drop in than the threading module, as we don’t need to add a class like the threading example. The only changes we need to make are in the main function.

multiprocessing module

To use multiple processes we create a multiprocessing Pool. With the map method it provides, we will pass the list of URLs to the pool, which in turn will spawn 8 new processes and use each one to download the images in parallel. This is true parallelism, but it comes with a cost. The entire memory of the script is copied into each subprocess that is spawned. In this simple example it isn’t a big deal, but it can easily become serious overhead for non-trivial programs.

from functools import partial
from multiprocessing.pool import Pool
def main():
ts = time()
client_id = os.getenv('IMGUR_CLIENT_ID')
if not client_id:
raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
download_dir = setup_download_dir()
links = [l for l in get_links(client_id) if l.endswith('.jpg')]
download = partial(download_link, download_dir)
with Pool(8) as p:
p.map(download, links)
print('Took {}s'.format(time() - ts))

Like what you're reading?
Get the latest updates first.
No spam. Just great engineering and design posts.
  •  

Distributing to Multiple Workers

While the threading and multiprocessing modules are great for scripts that are running on your personal computer, what should you do if you want the work to be done on a different machine, or you need to scale up to more than the CPU on one machine can handle? A great use case for this is long-running back-end tasks for web applications. If you have some long running tasks, you don’t want to spin up a bunch of subprocesses or threads on the same machine that need to be running the rest of your application code. This will degrade the performance of your application for all of your users. What would be great is to be able to run these jobs on another machine, or many other machines.

A great Python library for this task is RQ, a very simple yet powerful library. You first enqueue a function and its arguments using the library. This pickles the function call representation, which is then appended to a Redis list. Enqueueing the job is the first step, but will not do anything yet. We also need at least one worker to listen on that job queue.

RQ python library

The first step is to install and run a Redis server on your computer, or have access to a running Redis server. After that, there are only a few small changes made to the existing code. We first create an instance of an RQ Queue and pass it an instance of a Redis server from the redis-py library. Then, instead of just calling our “download_link” method, we call “q.enqueue(download_link, download_dir, link)”. The enqueue method takes a function as its first argument, then any other arguments or keyword arguments are passed along to that function when the job is actually executed.

One last step we need to do is to start up some workers. RQ provides a handy script to run workers on the default queue. Just run “rqworker” in a terminal window and it will start a worker listening on the default queue. Please make sure your current working directory is the same as where the scripts reside in. If you want to listen to a different queue, you can run “rqworker queue_name” and it will listen to that named queue. The great thing about RQ is that as long as you can connect to Redis, you can run as many workers as you like on as many different machines as you like; therefore, it is very easy to scale up as your application grows. Here is the source for the RQ version:

from redis import Redis
from rq import Queue
def main():
client_id = os.getenv('IMGUR_CLIENT_ID')
if not client_id:
raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
download_dir = setup_download_dir()
links = [l for l in get_links(client_id) if l.endswith('.jpg')]
q = Queue(connection=Redis(host='localhost', port=6379))
for link in links:
q.enqueue(download_link, download_dir, link)

However, RQ is not the only Python job queue solution. RQ is easy to use and covers simple use cases extremely well, but if more advanced options are required, other job queue solutions (such as Celery) can be used.

  • Last comment in
  •  0 Comments


  • Avoid the 10 Most Common Mistakes That Python Programmers Make

    About Python

    Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components or services. Python supports modules and packages, thereby encouraging program modularity and code reuse.

    About this article

    Python’s simple, easy-to-learn syntax can mislead Python developers – especially those who are newer to the language – into missing some of its subtleties and underestimating the power of the diverse Python language.

    With that in mind, this article presents a “top 10” list of somewhat subtle, harder-to-catch mistakes that can bite even some more advanced Python developers in the rear.

    This Python found himself caught in an advanced Python programming mistakes.

    (Note: This article is intended for a more advanced audience than Common Mistakes of Python Programmers, which is geared more toward those who are newer to the language.)

    Common Mistake #1: Misusing expressions as defaults for function arguments

    Python allows you to specify that a function argument is optional by providing a default value for it. While this is a great feature of the language, it can lead to some confusion when the default value is mutable. For example, consider this Python function definition:

    >>> def foo(bar=[]):        # bar is optional and defaults to [] if not specified
    ...    bar.append("baz")    # but this line could be problematic, as we'll see...
    ...    return bar
    
    

    A common mistake is to think that the optional argument will be set to the specified default expression each time the function is called without supplying a value for the optional argument. In the above code, for example, one might expect that calling foo() repeatedly (i.e., without specifying a bar argument) would always return 'baz', since the assumption would be that each time foo() is called (without a bar argument specified) bar is set to [] (i.e., a new empty list).

    But let’s look at what actually happens when you do this:

    >>> foo()
    ["baz"]
    >>> foo()
    ["baz", "baz"]
    >>> foo()
    ["baz", "baz", "baz"]
    
    

    Huh? Why did it keep appending the default value of "baz" to an existing list each time foo() was called, rather than creating a new list each time?

    The more advanced Python programming answer is that the default value for a function argument is only evaluated once, at the time that the function is defined. Thus, the bar argument is initialized to its default (i.e., an empty list) only when foo() is first defined, but then calls to foo() (i.e., without a bar argument specified) will continue to use the same list to which bar was originally initialized.

    FYI, a common workaround for this is as follows:

    >>> def foo(bar=None):
    ...    if bar is None:		# or if not bar:
    ...        bar = []
    ...    bar.append("baz")
    ...    return bar
    ...
    >>> foo()
    ["baz"]
    >>> foo()
    ["baz"]
    >>> foo()
    ["baz"]
    
    

    Common Mistake #2: Using class variables incorrectly

    Consider the following example:

    >>> class A(object):
    ...     x = 1
    ...
    >>> class B(A):
    ...     pass
    ...
    >>> class C(A):
    ...     pass
    ...
    >>> print A.x, B.x, C.x
    1 1 1
    
    

    Makes sense.

    >>> B.x = 2
    >>> print A.x, B.x, C.x
    1 2 1
    
    

    Yup, again as expected.

    >>> A.x = 3
    >>> print A.x, B.x, C.x
    3 2 3
    
    

    What the $%#!&?? We only changed A.x. Why did C.x change too?

    In Python, class variables are internally handled as dictionaries and follow what is often referred to as Method Resolution Order (MRO). So in the above code, since the attribute x is not found in class C, it will be looked up in its base classes (only A in the above example, although Python supports multiple inheritance). In other words, C doesn’t have its own x property, independent of A. Thus, references to C.x are in fact references to A.x. This causes a Python problem unless it’s handled properly. Learn more aout class attributes in Python.

    Common Mistake #3: Specifying parameters incorrectly for an exception block

    Suppose you have the following code:

    >>> try:
    ...     l = ["a", "b"]
    ...     int(l[2])
    ... except ValueError, IndexError:  # To catch both exceptions, right?
    ...     pass
    ...
    Traceback (most recent call last):
    File "<stdin>", line 3, in <module>
    IndexError: list index out of range
    
    

    The problem here is that the except statement does not take a list of exceptions specified in this manner. Rather, In Python 2.x, the syntax except Exception, e is used to bind the exception to the optional second parameter specified (in this case e), in order to make it available for further inspection. As a result, in the above code, the IndexError exception is not being caught by the except statement; rather, the exception instead ends up being bound to a parameter named IndexError.

    The proper way to catch multiple exceptions in an except statement is to specify the first parameter as a tuple containing all exceptions to be caught. Also, for maximum portability, use the as keyword, since that syntax is supported by both Python 2 and Python 3:

    >>> try:
    ...     l = ["a", "b"]
    ...     int(l[2])
    ... except (ValueError, IndexError) as e:  
    ...     pass
    ...
    >>>
    
    

    Common Mistake #4: Misunderstanding Python scope rules

    Python scope resolution is based on what is known as the LEGB rule, which is shorthand for Local, Enclosing, Global, Built-in. Seems straightforward enough, right? Well, actually, there are some subtleties to the way this works in Python, which brings us to the common more advanced Python programming problem below. Consider the following:

    >>> x = 10
    >>> def foo():
    ...     x += 1
    ...     print x
    ...
    >>> foo()
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "<stdin>", line 2, in foo
    UnboundLocalError: local variable 'x' referenced before assignment
    
    

    What’s the problem?

    The above error occurs because, when you make an assignment to a variable in a scope, that variable is automatically considered by Python to be local to that scope and shadows any similarly named variable in any outer scope.

    Many are thereby surprised to get an UnboundLocalError in previously working code when it is modified by adding an assignment statement somewhere in the body of a function. (You can read more about this here.)

    It is particularly common for this to trip up developers when using lists. Consider the following example:

    >>> lst = [1, 2, 3]
    >>> def foo1():
    ...     lst.append(5)   # This works ok...
    ...
    >>> foo1()
    >>> lst
    [1, 2, 3, 5]
    >>> lst = [1, 2, 3]
    >>> def foo2():
    ...     lst += [5]      # ... but this bombs!
    ...
    >>> foo2()
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "<stdin>", line 2, in foo
    UnboundLocalError: local variable 'lst' referenced before assignment
    
    

    Huh? Why did foo2 bomb while foo1 ran fine?

    The answer is the same as in the prior example problem, but is admittedly more subtle.  foo1 is not making an assignment to lst, whereas foo2 is. Remembering that lst += [5] is really just shorthand for lst = lst + [5], we see that we are attempting to assign a value to lst (therefore presumed by Python to be in the local scope). However, the value we are looking to assign to lst is based on lst itself (again, now presumed to be in the local scope), which has not yet been defined. Boom.

    Common Mistake #5: Modifying a list while iterating over it

    The problem with the following code should be fairly obvious:

    >>> odd = lambda x : bool(x % 2)
    >>> numbers = [n for n in range(10)]
    >>> for i in range(len(numbers)):
    ...     if odd(numbers[i]):
    ...         del numbers[i]  # BAD: Deleting item from a list while iterating over it
    ...
    Traceback (most recent call last):
    File "<stdin>", line 2, in <module>
    IndexError: list index out of range
    
    

    Deleting an item from a list or array while iterating over it is a Python problem that is well known to any experienced software developer. But while the example above may be fairly obvious, even advanced developers can be unintentionally bitten by this in code that is much more complex.

    Fortunately, Python incorporates a number of elegant programming paradigms which, when used properly, can result in significantly simplified and streamlined code. A side benefit of this is that simpler code is less likely to be bitten by the accidental-deletion-of-a-list-item-while-iterating-over-it bug. One such paradigm is that of list comprehensions. Moreover, list comprehensions are particularly useful for avoiding this specific problem, as shown by this alternate implementation of the above code which works perfectly:

    >>> odd = lambda x : bool(x % 2)
    >>> numbers = [n for n in range(10)]
    >>> numbers[:] = [n for n in numbers if not odd(n)]  # ahh, the beauty of it all
    >>> numbers
    [0, 2, 4, 6, 8]
    
    

    Common Mistake #6: Confusing how Python binds variables in closures

    Considering the following example:

    	>>> def create_multipliers():
    	...     return [lambda x : i * x for i in range(5)]
    	>>> for multiplier in create_multipliers():
    	...     print multiplier(2)
    	...
    	
    	

    You might expect the following output:

    	0
    	2
    	4
    	6
    	8
    	
    	

    But you actually get:

    	8
    	8
    	8
    	8
    	8
    	
    	

    Surprise!

    This happens due to Python’s late binding behavior which says that the values of variables used in closures are looked up at the time the inner function is called. So in the above code, whenever any of the returned functions are called, the value of i is looked up in the surrounding scope at the time it is called (and by then, the loop has completed, so i has already been assigned its final value of 4).

    The solution to this common Python problem is a bit of a hack:

    	>>> def create_multipliers():
    	...     return [lambda x, i=i : i * x for i in range(5)]
    	...
    	>>> for multiplier in create_multipliers():
    	...     print multiplier(2)
    	...
    	0
    	2
    	4
    	6
    	8
    	
    	

    Voilà! We are taking advantage of default arguments here to generate anonymous functions in order to achieve the desired behavior. Some would call this elegant. Some would call it subtle. Some hate it. But if you’re a Python developer, it’s important to understand in any case.

    Common Mistake #7: Creating circular module dependencies

    Let’s say you have two files, a.py and b.py, each of which imports the other, as follows:

    In a.py:

    	import b
    	def f():
    	return b.x
    	print f()
    	
    	

    And in b.py:

    	import a
    	x = 1
    	def g():
    	print a.f()
    	
    	

    First, let’s try importing a.py:

    	>>> import a
    	1
    	
    	

    Worked just fine. Perhaps that surprises you. After all, we do have a circular import here which presumably should be a problem, shouldn’t it?

    The answer is that the mere presence of a circular import is not in and of itself a problem in Python. If a module has already been imported, Python is smart enough not to try to re-import it. However, depending on the point at which each module is attempting to access functions or variables defined in the other, you may indeed run into problems.

    So returning to our example, when we imported a.py, it had no problem importing b.py, since b.py does not require anything from a.py to be defined at the time it is imported. The only reference in b.py to a is the call to 

  • Last comment in
  •  0 Comments



  • Copyright(c) 2017 - PythonBlogs.com
    By using this website, you signify your acceptance of Terms and Conditions and Privacy Policy
    All rights reserved