WSGI: The Server-Application Interface for Python

In 1993, the web was still in its infancy, with about 14 million users and a hundred websites. Pages were static but there was already a need to produce dynamic content, such as up-to-date news and data. Responding to this, Rob McCool and other contributors implemented the Common Gateway Interface (CGI) in the National Center for Supercomputing Applications (NCSA) HTTPd web server (the forerunner of Apache). This was the first web server that could serve content generated by a separate application.

Since then, the number of users on the Internet has exploded, and dynamic websites have become ubiquitous. When first learning a new language or even first learning to code, developers, soon enough, want to know about how to hook their code into the web.

Python on the Web and the Rise of WSGI

Since the creation of CGI, much has changed. The CGI approach became impractical, as it required the creation of a new process at each request, wasting memory and CPU. Some other low-level approaches emerged, like FastCGI](http://www.fastcgi.com/) (1996) and mod_python (2000), providing different interfaces between Python web frameworks and the web server. As different approaches proliferated, the developer’s choice of framework ended up restricting the choices of web servers and vice versa.

To address this problem, in 2003 Phillip J. Eby proposed PEP-0333, the Python Web Server Gateway Interface (WSGI). The idea was to provide a high-level, universal interface between Python applications and web servers.

In 2003, PEP-3333 updated the WSGI interface to add Python 3 support. Nowadays, almost all Python frameworks use WSGI as a means, if not the only means, to communicate with their web servers. This is how DjangoFlask and many other popular frameworks do it.

This article intends to provide the reader with a glimpse into how WSGI works, and allow the reader to build a simple WSGI application or server. It is not meant to be exhaustive, though, and developers intending to implement production-ready servers or applications should take a more thorough look into the WSGI specification.

The Python WSGI Interface

WSGI specifies simple rules that the server and application must conform to. Let’s start by reviewing this overall pattern.

The Python WSGI server-application interface.

Application Interface

In Python 3.5, the application interfaces goes like this:

def application(environ, start_response):
body = b'Hello world!\n'
status = '200 OK'
headers = [('Content-type', 'text/plain')]
start_response(status, headers)
return [body]

In Python 2.7, this interface wouldn’t be much different; the only change would be that the body is represented by a str object, instead of a bytes one.

Though we’ve used a function in this case, any callable will do. The rules for the application object here are:

  • Must be a callable with environ and start_response parameters.
  • Must call the start_response callback before sending the body.
  • Must return an iterable with pieces of the document body.

Another example of an object that satisfies these rules and would produce the same effect is:

class Application:
def __init__(self, environ, start_response):
self.environ = environ
self.start_response = start_response
def __iter__(self):
body = b'Hello world!\n'
status = '200 OK'
headers = [('Content-type', 'text/plain')]
self.start_response(status, headers)
yield body

Server Interface

A WSGI server might interface with this application like this::

def write(chunk):
\[code\]'Write data back to client\[/code\]'
...
def send_status(status):
\[code\]'Send HTTP status code\[/code\]'
...
def send_headers(headers):
\[code\]'Send HTTP headers\[/code\]'
...
def start_response(status, headers):
\[code\]'WSGI start_response callable\[/code\]'
send_status(status)
send_headers(headers)
return write
# Make request to application
response = application(environ, start_response)
try:
for chunk in response:
write(chunk)
finally:
if hasattr(response, 'close'):
response.close()

As you may have noticed, the start_response callable returned a write callable that the application may use to send data back to the client, but that was not used by our application code example. This write interface is deprecated, and we can ignore it for now. It will be briefly discussed later in the article.

Another peculiarity of the server’s responsibilities is to call the optional close method on the response iterator, if it exists. As pointed out in Graham Dumpleton’s article here, it is an often-overlooked feature of WSGI. Calling this method, if it exists, allows the application to release any resources that it may still hold.

The Application Callable’s environ Argument

The environ parameter should be a dictionary object. It is used to pass request and server information to the application, much in the same way CGI does. In fact, all CGI environment variables are valid in WSGI and the server should pass all that apply to the application.

While there are many optional keys that can be passed, several are mandatory. Taking as an example the following GET request:

$ curl 'http://localhost:8000/auth?user=obiwan&token=123'

These are the keys that the server must provide, and the values they would take:

KeyValueComments
REQUEST_METHOD "GET"
SCRIPT_NAME "" server setup dependent
PATH_INFO "/auth"
QUERY_STRING "token=123"
CONTENT_TYPE ""
CONTENT_LENGTH ""
SERVER_NAME "127.0.0.1" server setup dependent
SERVER_PORT "8000"
SERVER_PROTOCOL "HTTP/1.1"
HTTP_(...) Client supplied HTTP headers
wsgi.version (1, 0) tuple with WSGI version
wsgi.url_scheme "http"
wsgi.input File-like object
wsgi.errors File-like object
wsgi.multithread False True if server is multithreaded
wsgi.multiprocess False True if server runs multiple processes
wsgi.run_once False True if the server expects this script to run only once (e.g.: in a CGI environment)

The exception to this rule is that if one of these keys were to be empty (like CONTENT_TYPE in the above table), then they can be omitted from the dictionary, and it will be assumed they correspond to the empty string.

wsgi.input and wsgi.errors

Most environ keys are straightforward, but two of them deserve a little more clarification: wsgi.input, which must contain a stream with the request body from the client, and wsgi.errors, where the application reports any errors it encounters. Errors sent from the application to wsgi.errors typically would be sent to the server error log.

These two keys must contain file-like objects; that is, objects that provide interfaces to be read or written to as streams, just like the object we get when we open a file or a socket in Python. This may seem tricky at first, but fortunately, Python gives us good tools to handle this.

First, what kind of streams are we talking about? As per WSGI definition, wsgi.input and wsgi.errors must handle bytes objects in Python 3 and str objects in Python 2. In either case, if we’d like to use an in-memory buffer to pass or get data through the WSGI interface, we can use the class io.BytesIO.

As an example, if we are writing a WSGI server, we could provide the request body to the application like this:

  • For Python 2.7
import io
...
request_data = 'some request body'
environ['wsgi.input'] = io.BytesIO(request_data)

  • For Python 3.5
import io
...
request_data = 'some request body'.encode('utf-8') # bytes object
environ['wsgi.input'] = io.BytesIO(request_data)

On the application side, if we wanted to turn a stream input we’ve received into a string, we’d want to write something like this:

  • For Python 2.7
readstr = environ['wsgi.input'].read() # returns str object

  • For Python 3.5
readbytes = environ['wsgi

Python Design Patterns: For Sleek And Fashionable Code

Let’s say it again: Python is a high-level programming language with dynamic typing and dynamic binding. I would describe it as a powerful, high-level dynamic language. Many developers are in love with Python because of its clear syntax, well structured modules and packages, and for its enormous flexibility and range of modern features.

In Python, nothing obliges you to write classes and instantiate objects from them. If you don’t need complex structures in your project, you can just write functions. Even better, you can write a flat script for executing some simple and quick task without structuring the code at all.

At the same time Python is a 100 percent object-oriented language. How’s that? Well, simply put, everything in Python is an object. Functions are objects, first class objects (whatever that means). This fact about functions being objects is important, so please remember it.

So, you can write simple scripts in Python, or just open the Python terminal and execute statements right there (that’s so useful!). But at the same time, you can create complex frameworks, applications, libraries and so on. You can do so much in Python. There are of course a number of limitations, but that’s not the topic of this article.

However, because Python is so powerful and flexible, we need some rules (or patterns) when programming in it. So, let see what patterns are, and how they relate to Python. We will also proceed to implement a few essential Python design patterns.

Why Is Python Good For Patterns?

Any programming language is good for patterns. In fact, patterns should be considered in the context of any given programming language. Both the patterns, language syntax and nature impose limitations on our programming. The limitations that come from the language syntax and language nature (dynamic, functional, object oriented, and the like) can differ, as can the reasons behind their existence. The limitations coming from patterns are there for a reason, they are purposeful. That’s the basic goal of patterns; to tell us how to do something and how not to do it. We’ll speak about patterns, and especially Python design patterns, later.

Python is a dynamic and flexible language. Python design patterns are a great way of harnessing its vast potential.

Python is a dynamic and flexible language. Python design patterns are a great way of harnessing its vast potential.

Python’s philosophy is built on top of the idea of well thought out best practices. Python is a dynamic language (did I already said that?) and as such, already implements, or makes it easy to implement, a number of popular design patterns with a few lines of code. Some design patterns are built into Python, so we use them even without knowing. Other patterns are not needed due of the nature of the language.

For example, Factory is a structural Python design pattern aimed at creating new objects, hiding the instantiation logic from the user. But creation of objects in Python is dynamic by design, so additions like Factory are not necessary. Of course, you are free to implement it if you want to. There might be cases where it would be really useful, but they’re an exception, not the norm.

What is so good about Python’s philosophy? Let’s start with this (explore it in the Python terminal):

>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

These might not be patterns in the traditional sense, but these are rules that define the “Pythonic” approach to programming in the most elegant and useful fashion.

We have also PEP-8 code guidelines that help structure our code. It’s a must for me, with some appropriate exceptions, of course. By the way, these exceptions are encouraged by PEP-8 itself:

“But most importantly: know when to be inconsistent – sometimes the style guide just doesn’t apply. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don’t hesitate to ask!”

Combine PEP-8 with The Zen of Python (also a PEP - PEP-20), and you’ll have a perfect foundation to create readable and maintainable code. Add Design Patterns and you are ready to create every kind of software system with consistency and evolvability.

Python Design Patterns

What Is A Design Pattern?

Everything starts with the Gang of Four (GOF). Do a quick online search if you are not familiar with the GOF.

Design patterns are a common way of solving well known problems. Two main principles are in the bases of the design patterns defined by the GOF:

  • Program to an interface not an implementation.
  • Favor object composition over inheritance.

Let’s take a closer look at these two principles from the perspective of Python programmers.

Program to an interface not an implementation

Think about Duck Typing. In Python we don’t like to define interfaces and program classes according these interfaces, do we? But, listen to me! This doesn’t mean we don’t think about interfaces, in fact with Duck Typing we do that all the time.

Let’s say some words about the infamous Duck Typing approach to see how it fits in this paradigm: program to an interface.

If it looks like a duck and quacks like a duck, it's a duck!

If it looks like a duck and quacks like a duck, it's a duck!

We don’t bother with the nature of the object, we don’t have to care what the object is; we just want to know if it’s able to do what we need (we are only interested in the interface of the object).

Can the object quack? So, let it quack!

try:
bird.quack()
except AttributeError:
self.lol()

Did we define an interface for our duck? No! Did we program to the interface instead of the implementation? Yes! And, I find this so nice.

As Alex Martelli points out in his well known presentation about Design Patterns in Python, “Teaching the ducks to type takes a while, but saves you a lot of work afterwards!”

Favor object composition over inheritance

Now, that’s what I call a Pythonic principle! I have created fewer classes/subclasses compared to wrapping one class (or more often, several classes) in another class.

Instead of doing this:

class User(DbObject):
pass

We can do something like this:

class User:
_persist_methods = ['get', 'save', 'delete']
def __init__(self, persister):
self._persister = persister
def __getattr__(self, attribute):
if attribute in self._persist_methods:
return getattr(self._persister, attribute)

The advantages are obvious. We can restrict what methods of the wrapped class to expose. We can inject the persister instance in runtime! For example, today it’s a relational database, but tomorrow it could be whatever, with the interface we need (again those pesky ducks).

Composition is elegant and natural to Python.

Behavioral Patterns

Behavioural Patterns involve communication between objects, how objects interact and fulfil a given task. According to GOF principles, there are a total of 11 behavioral patterns in Python: Chain of responsibility, Command, Interpreter, Iterator, Mediator, Memento, Observer, State, Strategy, Template, Visitor.

Behavioural patterns deal with inter-object communication, controlling how various objects interact and perform different tasks.

Behavioural patterns deal with inter-object communication, controlling how various objects interact and perform different tasks.

I find these patterns very useful, but this does not mean the other pattern groups are not.

Iterator

Iterators are built into Python. This is one of the most powerful characteristics of the language. Years ago, I read somewhere that iterators make Python awesome, and I think this is still the case. Learn enough about Python iterators and generators and you’ll know everything you need about this particular Python pattern.

Chain of responsibility

This pattern gives us a way to treat a request using different methods, each one addressing a specific part of the request. You know, one of the best principles for good code is the Single Responsibility principle.

Every piece of code must do one, and only one, thing.

This principle is deeply integrated in this design pattern.

For example, if we want to filter some content we can implement different filters, each one doing one precise and clearly defined type of filtering. These filters could be used to filter offensive words, ads, unsuitable video content, and so on.

class ContentFilter(object):
def __init__(self, filters=None):
self._filters = list()
if filters is not None:
self._filters += filters
def filter(self, content):
for filter in self._filters:
content = filter(content)
return content
filter = ContentFilter([
offensive_filter,
ads_filter,
porno_video_filter])
filtered_content = filter.filter(content)

Command

This is one of the first Python design patterns I implemented as a programmer. That reminds me: Patterns are not invented, they are discovered. They exist, we just need to find and put them to use. I discovered this one for an amazing project we implemented many years ago: a special purpose WYSIWYM XML editor. After using this pattern intensively in the code, I read more about it on some sites.

The command pattern is handy in situations when, for some reason, we need to start by preparing what will be executed and then to execute it when needed. The advantage is that encapsulating actions in such a way enables Python developers to add additional functionalities related to the executed actions, such as undo/redo, or keeping a history of actions and the like.

Let’s see what a simple and frequently used example looks like:

class RenameFileCommand(object):
def __init__(self, from_name, to_name):
self._from = from_name
self._to = to_name
def execute(self):
os.rename(self._from, self._to)
def undo(self):
os.rename(self._to, self._from)
class History(object):
def __init__(self):
self._commands = list()
def execute(self, command):
self._commands.append(command)
command.execute()
def undo(self):
self._commands.pop().undo()
history = History()
history.execute(RenameFileCommand('docs/cv.doc', 'docs/cv-en.doc'))
history.execute(RenameFileCommand('docs/cv1.doc', 'docs/cv-bg.doc'))
history.undo()
history.undo()

Like what you're reading?
Get the latest updates first.
No spam. Just great engineering and design posts.

Creational Patterns

Let’s start by pointing out that creational patterns are not commonly used in Python. Why? Because of the dynamic nature of the language.

Someone wiser than I once said that Factory is built into Python. It means that the language itself provides us with all the flexibility we need to create objects in a sufficiently elegant fashion; we rarely need to implement anything on top, like Singleton or Factory.

In one Python Design Patterns tutorial, I found a description of the creational design patterns that stated these design “patterns provide a way to create objects while hiding the creation logic, rather than instantiating objects directly using a new operator.”

That pretty much sums up the problem: We don’t have a new operator in Python!

Nevertheless, let’s see how we can implement a few, should we feel we might gain an advantage by using such patterns.

Singleton

The Singleton pattern is used when we want to guarantee that only one instance of a given class exists during runtime. Do we really need this pattern in Python? Based on my experience, it’s easier to simply create one instance intentionally and then use it instead of implementing the Singleton pattern.

But should you want to implement it, here is some good news: In Python, we can alter the instantiation process (along with virtually anything else). Remember the __new__() method I mentioned earlier? Here we go:

class Logger(object):
def __new__(cls, *args, **kwargs):
if not hasattr(cls, '_logger'):
cls._logger = super(Logger, cls
).__new__(cls, *args, **kwargs)
return cls._logger

In this example, Logger is a Singleton.

These are the alternatives to using a Singleton in Python:

  • Use a module.
  • Create one instance somewhere at the top-level of your application, perhaps in the config file.
  • Pass the instance to every object that needs it. That’s a dependency injection and it’s a powerful and easily mastered mechanism.

Dependency Injection

I don’t intend to get into a discussion on whether dependency injection is a design pattern, but I will say that it’s a very good mechanism of implementing loose couplings, and it helps make our application maintainable and extendable. Combine it with Duck Typing and the Force will be with you. Always.

Duck? Human? Python does not care. It's flexible!

Duck? Human? Python does not care. It's flexible!

I listed it in the creational pattern section of this post because it deals with the question of when (or even better: where) the object is created. It’s created outside. Better to say that the objects are not created at all where we use them, so the dependency is not created where it is consumed. The consumer code receives the externally created object and uses it. For further reference, please read the most upvoted answer to this Stackoverflow question.

It’s a nice explanation of dependency injection and gives us a good idea of the potential of this particular technique. Basically the answer explains the problem with the following example: Don’t get things to drink from the fridge yourself, state a need instead. Tell your parents that you need something to drink with lunch.

Python offers us all we need to implement that easily. Think about its possible implementation in other languages such as Java and C#, and you’ll quickly realize the beauty of Python.

Let’s think about a simple example of dependency injection:

class Command:
def __init__(self, authenticate=None, authorize=None):
self.authenticate = authenticate or self._not_authenticated
self.authorize = authorize or self._not_autorized
def execute(self, user, action):
self.authenticate(user)
self.authorize(user, action)
return action()
if in_sudo_mode:
command = Command(always_authenticated, always_authorized)
else:
command = Command(config.authenticate, config.authorize)
command.execute(current_user, delete_user_action)

We inject the authenticator and authorizer methods in the Command class. All the Command class needs is to execute them successfully without bothering with the implementation details. This way, we may use the Command class with whatever authentication and authorization mechanisms we decide to use in runtime.

We have shown how to inject dependencies through the constructor, but we can easily inject them by setting directly the object properties, unlocking even more potential:

command = Command()
if in_sudo_mode:
command.authenticate = always_authenticated
command.authorize = always_authorized
else:
command.authenticate = config.authenticate
command.authorize = config.authorize
command.execute(current_user, delete_user_action)

There is much more to learn about dependency injection; curious people would search for IoC, for example.

But before you do that, read another Stackoverflow answer, the most upvoted one to this question.

Again, we just demonstrated how implementing this wonderful design pattern in Python is just a matter of using the built-in functionalities of the language.

Let’s not forget what all this means: The dependency injection technique allows for very flexible and easy unit-testing. Imagine an architecture where you can change data storing on-the-fly. Mocking a database becomes a trivial task, doesn’t it? For further information, you can check out Toptal’s Introduction to Mocking in Python.

You may also want to research PrototypeBuilder and Factory design patterns.

Structural Patterns

Facade

This may very well be the most famous Python design pattern.

Read the full article in 

Migrate Legacy Data Without Screwing It Up

Migrating legacy data is hard.

Many organizations have old and complex on-premise business CRM systems. Today, there are plenty of cloud SaaS alternatives, which come with many benefits; pay as you go and to pay only for what you use. Therefore, they decide to move to the new systems.

Nobody wants to leave valuable data about customers in the old system and start with the empty new system, so we need to migrate this data. Unfortunately, data migration is not an easy task, as around 50 percent of deployment effort is consumed by data migration activities. According to Gartner, Salesforce is the leader of cloud CRM solutions. Therefore, data migration is a major topic for Salesforce deployment.

10 Tips For Successful Legacy Data Migration To Salesforce

How to ensure successful transition of legacy data into a new system
while preserving all history.

So, how can we ensure a successful transition of legacy data into a shiny new system and ensure we will preserve all of its history? In this article, I provide 10 tips for successful data migration. The first five tips apply to any data migration, regardless of the technology used.

Data Migration in General

1. Make Migration a Separate Project

In the software deployment checklist, data migration is not an “export and import” item handled by a clever “push one button” data migration tool that has predefined mapping for target systems.

Data migration is a complex activity, deserving a separate project, plan, approach, budget, and team. An entity level scope and plan must be created at the project’s beginning, ensuring no surprises, such as “Oh, we forgot to load those clients’ visit reports, who will do that?” two weeks before the deadline.

The data migration approach defines whether we will load the data in one go (also known as the big bang), or whether we will load small batches every week.

This is not an easy decision, though. The approach must be agreed upon and communicated to all business and technical stakeholders so that everybody is aware of when and what data will appear in the new system. This applies to system outages too.

2. Estimate Realistically

Do not underestimate the complexity of the data migration. Many time-consuming tasks accompany this process, which may be invisible at the project’s beginning.

For example, loading specific data sets for training purposes with a bunch of realistic data, but with sensitive items obfuscated, so that training activities do not generate email notifications to clients.

The basic factor for estimation is the number of fields to be transferred from a source system to a target system.

Some amount of time is needed in different stages of the project for every field, including understanding the field, mapping the source field to the target field, configuring or building transformations, performing tests, measuring data quality for the field, and so on.

Using clever tools, such as Jitterbit, Informatica Cloud Data Wizard, Starfish ETL, Midas, and the like, can reduce this time, especially in the build phase.

In particular, understanding the source data – the most crucial task in any migration project – cannot be automated by tools, but requires analysts to take time going through the list of fields one by one.

The simplest estimate of the overall effort is one man-day for every field transferred from the legacy system.

An exception is data replication between the same source and target schemas without further transformation – sometimes known as 1:1 migration – where we can base the estimate on the number of tables to copy.

A detailed estimate is an art of its own.

3. Check Data Quality

Do not overestimate the quality of source data, even if no data quality issues are reported from the legacy systems.

New systems have new rules, which may be violated with legacy data. Here’s a simple example. Contact email can be mandatory in the new system, but a 20-year-old legacy system may have a different point of view.

There can be mines hidden in historical data that have not been touched for a long time but could activate when transferring to the new system. For example, old data using European currencies that do not exist anymore need to be converted to Euros, otherwise, currencies must be added to the new system.

Data quality significantly influences effort, and the simple rule is: The further we go in history, the bigger mess we will discover. Thus, it is vital to decide early on how much history we want to transfer into the new system.

4. Engage Business People

Business people are the only ones who truly understand the data and who can therefore decide what data can be thrown away and what data to keep.

It is important to have somebody from the business team involved during the mapping exercise, and for future backtracking, it is useful to record mapping decisions and the reasons for them.

Since a picture is worth more than a thousand words, load a test batch into the new system, and let the business team play with it.

Even if data migration mapping is reviewed and approved by the business team, surprises can appear once the data shows up in the new system’s UI.

“Oh, now I see, we have to change it a bit,” becomes a common phrase.

Failing to engage subject matter experts, who are usually very busy people, is the most common cause of problems after a new system goes live.

5. Aim for Automated Migration Solution

Data migration is often viewed as a one-time activity, and developers tend to end up with solutions full of manual actions hoping to execute them only once. But there are many reasons to avoid such an approach.

  • If migration is split into multiple waves, we have to repeat the same actions multiple times.
  • Typically, there are at least three migration runs for every wave: a dry run to test the performance and functionality of data migration, a full data validation load to test the entire data set and to perform business tests, and of course, production load. The number of runs increases with poor data quality. Improving data quality is an iterative process, so we need several iterations to reach the desired success ratio.

Thus, even if migration is one-time activity by nature, having manual actions can significantly slow down your operations.

Salesforce Data Migration

Next we will cover five tips for a successful Salesforce migration. Keep in mind, these tips are likely applicable to other cloud solutions as well.

6. Prepare for Lengthy Loads

Performance is one of, if not the biggest, tradeoff when moving from an on-premise to a cloud solution – Salesforce not excluded.

On-premise systems usually allow for direct data load into an underlying database, and with good hardware, we can easily reach millions of records per hour.

But, not in the cloud. In the cloud, we are heavily limited by several factors.

  • Network latency – Data is transferred via the internet.
  • Salesforce application layer – Data is moved through a thick API multitenancy layer until they land in their Oracle databases.
  • Custom code in Salesforce – Custom validations, triggers, workflows, duplication detection rules, and so on – many of which disable parallel or bulk loads.

As a result, load performance can be thousands of accounts per hour.

It can be less, or it can be more, depending on things, such as the number of fields, validations and triggers. But it is several grades slower than a direct database load.

Performance degradation, which is dependent on the volume of the data in Salesforce, must also be considered.

It is caused by indexes in the underlying RDBMS (Oracle) used for checking foreign keys, unique fields, and evaluation of duplication rules. The basic formula is approximately 50 percent slowdown for every grade of 10, caused by O(logN) the time complexity portion in sort and B-tree operations.

Moreover, Salesforce has many resource usage limits.

One of them is the Bulk API limit set to 5,000 batches in 24-hour rolling windows, with the maximum of 10,000 records in each batch.

So, the theoretical maximum is 50 million records loaded in 24 hours.

In a real project, the maximum is much lower due to limited batch size when using, for example, custom triggers.

This has a strong impact on the data migration approach.

Even for medium-sized datasets (from 100,000 to 1 million accounts), the big bang approach is out of the question, so we must split data into smaller migration waves.

This, of course, impacts the entire deployment process and increases the migration complexity because we will be adding data increments into a system already populated by previous migrations and data entered by users.

We must also consider this existing data in the migration transformations and validations.

Further, lengthy loads can mean we cannot perform migrations during a system outage.

If all users are located in one country, we can leverage an eight-hour outage during the night.

But for a company, such as Coca-Cola, with operations all over the world, that is not possible. Once we have U.S., Japan, and Europe in the system, we span all time zones, so Saturday is the only outage option that doesn’t affect users.

And that may not be enough, so, we must load while online, when users are working with the system.

7. Respect Migration Needs in Application Development

Application components, such as validations and triggers, should be able to handle data migration activities. Hard disablement of validations at the time of the migration load is not an option if the system must be online. Instead, we have to implement different logic in validations for changes performed by a data migration user.

  • Date fields should not be compared to the actual system date because that would disable the loading of historical data. For example, validation must allow entering a past account start date for migrated data.
  • Mandatory fields, which may not be populated with historical data, must be implemented as non-mandatory, but with validation sensitive to the user, thus allowing empty values for data coming from the migration, but rejecting empty values coming from regular users via the GUI.
  • Triggers, especially those sending new records to the integration, must be able to be switched on/off for the data migration user in order to prevent flooding the integration with migrated data.

Another trick is using field Legacy ID or Migration ID in every migrated object. There are two reasons for this. The first is obvious: To keep the ID from the old system for backtracking; after the data is in the new system, people may still want to search their accounts using the old IDs, found in places as emails, documents, and bug-tracking systems. Bad habit? Maybe. But users will thank you if you preserve their old IDs. The second reason is technical and comes from the fact Salesforce does not accept explicitly provided IDs for new records (unlike Microsoft Dynamics) but generates them during the load. The problem arises when we want to load child objects because we have to assign them IDs of the parent objects. Since we will know those IDs only after loading, this is a futile exercise.

Let’s use Accounts and their Contacts, for example:

  1. Generate data for Accounts.
  2. Load Accounts into Salesforce, and receive generated IDs.
  3. Incorporate new Account IDs in Contact data.
  4. Generate data for Contacts.
  5. Load Contacts in Salesforce.

We can do this more simply by loading Accounts with their Legacy IDs stored in a special external field. This field can be used as a parent reference, so when loading Contacts, we simply use the Account Legacy ID as a pointer to the parent Account:

  1. Generate data for Accounts, including Legacy ID.
  2. Generate data for Contacts, including Account Legacy ID.
  3. Load Accounts into Salesforce.
  4. Load Contacts in Salesforce, using Account Legacy ID as parent reference.

The nice thing here is that we have separated a generation and a loading phase, which allows for better parallelism, decrease outage time, and so on.

8. Be Aware of Salesforce Specific Features

Like any system, Salesforce has plenty of tricky parts of which we should be aware in order to avoid unpleasant surprises during data migration. Here are handful of examples:

  • Some changes on active Users automatically generate email notifications to user emails. Thus, if we want to play with user data, we need to deactivate users first and activate after changes are completed. In test environments, we scramble user emails so that notifications are not fired at all. Since active users consume costly licenses, we are not able to have all users active in all test environments. We have to manage subsets of active users, for example, to activate just those in a training environment.
  • Inactive users, for some standard objects such as Account or Case, can be assigned only after granting the system permission “Update Records with Inactive Owners,” but they can be assigned, for example, to Contacts and all custom objects.
  • When Contact is deactivated, all opt out fields are silently turned on.
  • When loading a duplicate Account Team Member or Account Share object, the existing record is silently overwritten. However, when loading a duplicate Opportunity Partner, the record is simply added resulting in a duplicate.
  • System fields, such as Created DateCreated By IDLast Modified DateLast Modified By ID, can be explicitly written only after granting a new system permission “Set Audit Fields upon Record Creation.”
  • History-of-field value changes cannot be migrated at all.
  • Owners of knowledge articles cannot be specified during the load but can be updated later.
  • The tricky part is the storing of content (documents, attachments) into Salesforce. There are multiple ways to do it (using Attachments, Files, Feed attachments, Documents), and each way has its pros and cons, including different file size limits.
  • Picklist fields force users to select one of the allowed values, for example, a type of account. But when loading data using Salesforce API (or any tool built upon it, such as Apex Data Loader or Informatica Salesforce connector), any value will pass.

The list goes on, but the bottom line is: Get familiar with the system, and learn what it can do and what it cannot do before you make assumptions. Do not assume standard behavior, especially for core objects. Always research and test.

9. Do Not Use Salesforce as a Data Migration Platform

It is very tempting to use Salesforce as a platform for building a data migration solution, especially for Salesforce developers. It is the same technology for the data migration solution as for the Salesforce application customization, the same GUI, the same Apex programming language, the same infrastructure. Salesforce has objects which can act as tables, and a kind of SQL language, Salesforce Object Query Language (SOQL). However, please do not use it; it would be a fundamental architectural flaw.

Salesforce is an excellent SaaS application with a lot of nice features, such as advanced collaboration and rich customization, but mass processing of data is not one of them. The three most significant reasons are:

  • Performance – Processing of data in Salesforce is several grades slower than in RDBMS.
  • Lack of analytical features – Salesforce SOQL does not support complex queries and analytical functions that must be supported by Apex language, and would degrade performance even more.
  • Architecture* – Putting a data migration platform inside a specific Salesforce environment is not very convenient. Usually, we have multiple environments for specific purposes, often created ad hoc so that we can put a lot of time on code synchronization. Plus, you would also be relying on connectivity and availability of that specific Salesforce environment.

Instead, build a data migration solution in a separate instance (it could be a cloud or on-premise) using an RDBMS or ETL platform. Connect it with source systems and target the Salesforce environments you want, move the data you need into your staging area and process it there. This will allow you to:

  • Leverage the full power and capabilities of the SQL language or ETL features.
  • Have all code and data in one place so that you can run analyses across all systems.
    • For example, you can combine the newest configuration from the most up-to-date test Salesforce environment with real data from the production Salesforce environment.
  • You are not so dependent upon the technology of the source and target systems and you can reuse your solution for the next project.

10. Oversight Salesforce Metadata

At the project beginning, we usually grab a list of Salesforce fields and start the mapping exercise. During the project, it often happens that new fields are added by the application development team into Salesforce, or that some field properties are changed. We can ask the application team to notify the data migration team about every data model change, but doesn’t always work. To be safe, we need to have all data model changes under supervision.

A common way to do this is to download, on a regular basis, migration-relevant metadata from Salesforce into some metadata repository. Once we have this, we can not only detect changes in the data model, but we can also compare data models of two Salesforce environments.

What metadata to download:

  • A list of objects with their labels and technical names and attributes such as creatable or updatable.
  • A list of fields with their attributes (better to get all of them).
  • A list of picklist values for picklist fields. We will need them to map or validate input data for correct values.
  • A list of validations, to make sure new validations are not creating problems for migrated data.

How to download metadata from Salesforce? Well, there is no standard metadata method, but there are multiple options:

  • Generate Enterprise WSDL – In the Salesforce web application navigate to the Setup / API menu and download strongly typed Enterprise WSDL, which describe all the objects and fields in Salesforce (but not picklist values nor validations).
  • Call Salesforce describeSObjects web service, directly or by using Java or C# wrapper (consult Salesforce API). This way, you get what you need, and this the recommended way to export the metadata.
  • Use any of the numerous alternative tools available on the internet.

Prepare for the Next Data Migration

Cloud solutions, such as Salesforce, are ready instantly. If you are happy with the built-in functionalities, just log in and use it. However, Salesforce, like any other cloud CRM solution, brings specific problems to data migration topics to be aware of, in particular, regarding the performance and resources limits.

Moving legacy data into the new system is always a journey, sometimes a journey to history hidden in data from past years. In this article, based on a dozen migration projects, I presented 10 tips how to migrate legacy data and successfully avoid the most pitfalls.

The key is to understand what the data reveals. So, before you start the data migration, make sure you are well prepared for the potential problems your data may hold.

This article was originally posted on Toptal