So here it is: my list of things not to do when doing Python web development.
Untrusted Data and File Systems
Unless you are running on a virtualized filesystem like when you are executing code on Google Appengine, chances are, vital files can be accessed with the rights your application has. Very few deployments actually reduce the rights of the executing user account to a level where it would become save to blindly trust user submitted filenames. Because it typically isn't, you have to think about that.In PHP land this is common knowledge by now because many people write innocent looking code like this:
<?php
include "header.php";
$page = isset($_GET['page']) ? $_GET['page'] : 'index';
$filename = $page . '.php';
if (file_exists($filename))
include $filename;
else
include "missing_page.php";
include "footer.php";
Python programmers apparently don't care too much about this problem because Python's file opening functions don't have this problem and reading files from the filesystem is a very uncommon thing to do anyways. However in the few situations where people do work with the filenames, always always will you find code like this:
def upload_file(file):
destination_file = os.path.join(UPLOAD_FOLDER, file.filename)
with open(destination_file, 'wb') as f:
copy_fd(file, f)
>>> import os
>>> os.path.join('/var/www/uploads', '../foo')
'/var/www/uploads/../foo'
>>> os.path.join('/var/www/uploads', '/foo')
'/foo'
So yes, os.path.join is totally not safe to use in a web context. Various libraries have ways that help you deal with this problem. Werkzeug for instance has a function called secure_filename that will strip any path separators from the file, slashes, even remove non-ASCII characters from the path as character sets and filesystems are immensly tricky. At the very least you should do this:
import os, re
_split = re.compile(r'[\0%s]' % re.escape(''.join(
[os.path.sep, os.path.altsep or ''])))
def secure_filename(path):
return _split.sub('', path)
>>> open('\0')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
If you actually do want to allow slashes in the filename there are a couple of things you have to consider. On POSIX the whole system is incredible easy: if it starts with a trailing slash or the combination of ../ it will or might try to reference a file outside of the folder you want the file to be in. That's easy to prevent:
import posixpath
def is_secure_path(path):
path = posixpath.normpath(path)
return not path.startswith(('/', '../'))
The following function checks if paths will not manage to escaped a folder on POSIX and Windows:
import os
non_slash_sep = [sep for sep in (os.path.sep, os.path.altsep)
if sep not in (None, '/')]
def is_in_folder(filename):
filename = os.path.normpath(filename)
for sep in non_slash_seps:
if sep in filename:
return False
return os.path.isabs(filename) or filename.startswith('../')
Generally speaking though, if you do aim for windows compatibility you have to be extra careful because Windows has its special device files in every folder on the filesystem for DOS compatibility. Writing to those might be problematic and could be abused for denial of service attacks.
Mixing up Data with Markup
This is a topic that always makes me cringe inside. I know it's very common and many don't see the issue with it but it's the root of a whole bunch of problems and unmaintainable code. Let's say you have some data. That data for all practical purposes will be a string of some arbitrary maximum length and that string will be of a certain format. Let's say it's prosaic text and we want to preserve newlines but collapse all other whitespace to a single space.A very common pattern.
However that data is usually displayed on a website in the context of HTML, so someone will surely bring up the great idea to escape the input text and convert newlines to <br> before feeding the data into the database. Don't do this!
There are a bunch of reasons for this but the most important one is called “context”. Web applications these days are getting more and more complex, mainly due to the concept of APIs. A lot of the functionality of the website that was previously only avaiable in an HTML form is now also available as RESTful interfaces speaking some other format such as JSON.
The context of a rendered text in your web application will most likely be “HTML”. In that context, <br> makes a lot of sense. But what if your transport format is JSON and the client on the other side is not (directly) rendering into HTML? This is the case for twitter clients for instance. Yet someone at Twitter decided that the string with the application name that is attached to each tweet should be in HTML. When I wrote my first JavaScript client for that API I was parsing that HTML with jQuery and fetching the application name as a string because I was only interested in that. Annoying. However even worse: someone found out a while later that this particular field could actually be used to emit arbitrary HTML. A major security disaster.
The other problem is if you have to reverse the stuff again. If you want to be able to edit that text again you would have to unescape it, reproduce the original newlines etc.
So there should be a very, very simple rule (and it's actually really simple): store the data as it comes in. Don't flip a single bit! (The only acceptable conversion before storing stuff in the database might be Unicode normalization)
When you have to display your stored information: provide a function that does that for you. If you fear that this could become a bottleneck: memcache it or have a second column in your database with the rendered information if you absolutely must. But never, ever let the HTML formatted version be the only thing you have in your database. And certainly never expose HTML strings over your API if all you want to do is to transmit text.
Every time I get a notification on my mobile phone from a certain notification service where the message would contain an umlaut the information arrives here completely broken. Turns out that one service assumes that HTML escaped information is to be transmitted, then however the other service only allows a few HTML escaped characters and completely freaks out when you substitute “รค” with “ä”. If you ever are in the situation where you have to think about “is this plain text that is HTML escaped or just plain text” you are in deep troubles already.
Spending too much Time with the Choice of Framework
This should probably go to the top. If you have a small application (say less than 10.000 lines of code) the framework probably isn't your problem anyways. And if you have more code than that, it's still not that hard to switch systems when you really have to. In fact even switching out core components like an ORM is possible and achievable if you write a little shim and get rid of that step by step. Better spend your time making the system better. The framework choice used to be a lot harder when the systems were incompatible. But this clearly no longer is the case.In fact, combine this with the next topic.
Building Monolithic Systems
We are living in an agile world. Some systems become deprecated before they are even finished :) In such an agile world new technologies are introduced at such a high speed that your favorite platform might not support it yet.As web developers we have the huge advantage that we have a nice protocol to separate systems: it's called HTTP and the base of all we do. Why not leverage that even further? Write small services that speak HTTP and bridge them together with another application. If that does not scale, put a load balancer between individual components. This has the nice side effect that each part of the system can be implemented in a different system. If Python does not have the library you need or does not have the performance: write a part of the System in Ruby/Java or whatever comes to mind.
But don't forget to still make it easy to deploy that system and put another machine in. If you end up with ten different programming languages with different runtime environments you are quickly making the life of your system administrator hell.
Stolen from: http://lucumr.pocoo.org/2010/12/24/common-mistakes-as-web-developer/
0 comments:
Post a Comment