HTTPS Certificate Verification in Python With urllib2

Posted on 08 January 2012 by Joseph

This post is a duplicate of one on my former site, muchtooscrawled.com. That site is no more, and this is the only post of any real quality, so I thought I would copy it over.

Everyone loves Python. I particularly feel encased in Python’s womb-like warmth and comfort when I am trying to do client-side communication with web servers or web services. Most of the magic has already been accomplished by the time I type import urllib2 – super simple and clean interfaces that seem to go increasingly deep as you need them. Request a page with a single line, do a GET or POST request with two lines, modify headers as needed, do secure communication with SSL; all of these things are simple and elegant, adding complexity only when needed for more complex goals.

Recently, I found a hole in this seemingly infinitely deep well of value added by urllib2. While the module will happily do SSL-secured communication for you, it fails to provide any easy way to verify server certificates. This is a critical feature, especially when using web services. For instance, if I wanted to use a service to version-check files on my system with files on a central server, allowing me to download the updates as needed, communicating with an unverified server could be disastrous. After poking around a bit online, I still hadn’t found anything useful in the urllib2 interface to help me accomplish this, so I started opening up the library files themselves. My goal was to use SSL with cert verification while still leveraging urllib2 for all of my high-level interface needs.

It turns out that it isn’t very difficult at all, despite the fact that the interfaces are not such that it is as easy as it could be to extend the functionality in this way. The ssl module already includes certificate verification, although you must supply your own trusted root certificates. These are easy to find, as it is in the interest of the CAs like Verisign and Thawte to publish these (for instance, your browser already has copies that it uses for certificate verification). The question then is how does one supply the appropriate parameters to the ssl.wrap_socket(...) function?

The answer is in this case, by subclassing the httplib.HTTPSConnection class to pass in the appropriate data. Here is an example:

class VerifiedHTTPSConnection(httplib.HTTPSConnection):
    def connect(self):
        # overrides the version in httplib so that we do
        #    certificate verification
        sock = socket.create_connection((self.host, self.port), self.timeout)
        if self._tunnel_host:
            self.sock = sock
            self._tunnel()
        # wrap the socket using verification with the root
        #    certs in trusted_root_certs
        self.sock = ssl.wrap_socket(sock,
                                    self.key_file,
                                    self.cert_file,
                                    cert_reqs=ssl.CERT_REQUIRED,
                                    ca_certs="trusted_root_certs")

The key is the two extra parameters, cert_reqs and ca_certs, in the call to wrap_socket. For a more complete discussion of the meaning of these parameters, please refer to the documentation.

The next step is integrating our new connection in such a way that allows us to use it with urllib2. This is done by installing a non-default HTTPS handler, by first subclassing the urllib2.HTTPSHandler class, then installing it as a handler in an OpenerDirector object using the urllib2.build_opener(...) function. Here is the example subclass:

# wraps https connections with ssl certificate verification
class VerifiedHTTPSHandler(urllib2.HTTPSHandler):
    def __init__(self, connection_class = VerifiedHTTPSConnection):
        self.specialized_conn_class = connection_class
        urllib2.HTTPSHandler.__init__(self)
    def https_open(self, req):
        return self.do_open(self.specialized_conn_class, req)

As you can see, I have added the connection class as a parameter to the constructor. Because of the way the handler classes are used, it would require substantially more work to be able to pass in the value of the ca_certs parameter to wrap_socket. Instead, you can just create different subclasses for different root certificate sets. This would be useful if you had a development server with a self-signed certificate and a production server with a CA-signed certificate, as you could swap them out at runtime or delivery time using the parameter to the constructor above.

With this class, you can either create an OpenerDirector object, or you can install a handler into urllib2 itself for use in the urlopen(...) function. Here is how to create the opener and use it to open a secure site with certificate verification:

https_handler = VerifiedHTTPSHandler()
url_opener = urllib2.build_opener(https_handler)
handle = url_opener.open('https://www.example.com')
response = handle.readlines()
handle.close()

If the certificate for example.com is not signed by one of the trusted authority keys in the file trusted_root_certs (from the VerifiedHTTPSConnection class), then the call to url_opener.open(...) will raise a urllib2.URLError exception with some debugging-type information from the ssl module. Otherwise, urllib2 functions just as normal, albeit now communication with a trusted source.

The Failure of Windows Update

Posted on 07 January 2012 by Joseph

Yesterday, I returned to my computer, left on to gather data over the course of several power management state changes, only to discover that Windows Update had automatically rebooted my machine to complete the installation of some critical, unnamed update. My data collection was truncated and needed to be restarted.

As every time this happens, I was infuriated, but I took a bit of time to think about it a little more carefully. I realized that fundamentally this is a disconnect that has arisen as a result of improved power management. In the simpler times of PCs-instead-of-laptops and poor system support for power management, an automatic update would most likely happen when nothing else was going on. The systems were mostly powered on, and updating in the middle of the night was no big deal.

Today, though, I (and probably most people) suspend my laptop whenever I am not using it by closing the lid. This forces the machine to update when I am actually using my computer, a time when an update and reboot is rarely convenient. As a result, I often postpone the updates. So when can they happen without the pesky user (me) interrupting them? Only when my computer is on but not actively being used, which is more or less by definition when I have some long-running automatic task going, such as my data collection yesterday.

What is the solution to this? How can Windows update itself without interrupting my tasks and sending me into fits of trichotillomania? The simplest answer is to change the default. Instead of automatically updating and rebooting, automatically update then notify the user that a reboot needs to happen in order to complete the update. The package management functionality in Windows is already plenty capable of this type of deferred installation. Unfortunately, this option is currently not only NOT the default as of Windows 7, but is not an option at all. Instead we much choose between inopportune reboots or user-initiated updating. Windows needs to catch up to a time when power management works and is embraced by many (most?) users.

Embedding Python in C++ Applications with boost::python: Part 4

Posted on 06 January 2012 by Joseph

In Part 2 of this ongoing tutorial, I introduced code for parsing Python exceptions from C++. In Part 3, I implemented a simple configuration parsing class utilizing the Python ConfigParser module. As part of that implementation, I mentioned that for a project of any scale, one would want to catch and deal with Python exceptions within the class, so that clients of the class wouldn’t have to know about the details of Python. From the perspective of a caller, then, the class would be just like any other C++ class.

The obvious way of handling the Python exceptions would be to handle them in each function. For example, the get function of the C++ ConfigParser class we created would become:

std::string ConfigParser::get(const std::string &attr, const std::string &section){
    try{
        return py::extract(conf_parser_.attr("get")(section, attr));
    }catch(boost::python::error_already_set const &){
        std::string perror_str = parse_python_exception();
        throw std::runtime_error("Error getting configuration option: " + perror_str);
    }
}
The error handling code remains the same, but now the `main` function becomes:
int main(){
    Py_Initialize();
    try{
        ConfigParser parser;
        parser.parse_file("conf_file.1.conf");
        ...
        // Will raise a NoOption exception 
         cout << "Proxy host: " << parser.get("ProxyHost", "Network") << endl;
    }catch(exception &e){
        cout << "Here is the error, from a C++ exception: " << e.what() << endl;
    }
}

When the Python exception is raised, it will be parsed and repackaged as a std::runtime_error, which is caught at the caller and handled like a normal C++ exception (i.e. without having to go through the parse_python_exception rigmarole). For a project that only has a handful of functions or a class or two utilizing embedded Python, this will certainly work. For a larger project, though, one wants to avoid the large amount of duplicated code and the errors it will inevitably bring.

For my implementation, I wanted to always handle the the errors in the same way, but I needed a way to call different functions with different signatures. I decided to leverage another powerful area of the boost library: the functors library, and specifically boost::bind and boost::function. boost::function provides functor class wrappers, and boost::bind (among other things) binds arguments to functions. The two together, then, enable the passing of functions and their arguments that can be called at a later time. Just what the doctor ordered!

To utilize the functor, the function needs to know about the return type. Since we're wrapping functions with different signatures, a function template does the trick nicely:

template <class return_type>
return_type call_python_func(boost::function<return_type ()> to_call, const std::string &error_pre){
    std::string error_str(error_pre);

    try{
        return to_call();
    }catch(boost::python::error_already_set const &){
        error_str = error_str + parse_python_exception();
        throw std::runtime_error(error_str);
    }
}

This function takes the functor object for a function calling boost::python functions. Each function that calls boost::python code will now be split into two functions: the private core function that uses the Python functionality and a public wrapper function that uses the call_python_func function. Here is the updated get function and its partner:

string ConfigParser::get(const string &attr, const string &section){
    return call_python_func<string>(boost::bind(&ConfigParser::get_py, this, attr, section),
                                    "Error getting configuration option: ");
}

string ConfigParser::get_py(const string &attr, const string &section){
    return py::extract<string>(conf_parser_.attr("get")(section, attr));
}

The get function binds the passed-in arguments, along with the implicit this pointer, to the get_py function, which in turn calls the boost::python functions necessary to perform the action. Simple and effective.

Of course, there is a tradeoff associated here. Instead of the repeated code of the try...catch blocks and Python error handling, there are double the number of functions declared per class. For my purposes, I prefer the second form, as it more effectively utilizes the compiler to find errors, but mileage may vary. The most important point is to handle Python errors at a level of code that understands Python. If your entire application needs to understand Python, you should consider rewriting in Python rather than embedding, perhaps with some C++ modules as needed.

As always, you can follow along with the tutorial by cloning the github repo.

Embedding Python in C++ Applications with boost::python: Part 3

Posted on 05 January 2012 by Joseph

In Part 2 of this tutorial, I covered a methodology for handling exceptions thrown from embedded Python code from within the C++ part of your application. This is crucial for debugging your embedded Python code. In this tutorial, we will create a simple C++ class that leverages Python functionality to handle an often-irritating part of developing real applications: configuration parsing.

In an attempt to not draw ire from the C++ elites, I am going to say this in a diplomatic way: I suck at complex string manipulations in C++. STL strings and stringstreams greatly simplify the task, but performing application-level tasks, and performing them in a robust way, always results in me writing more code that I would really like. As a result, I recently rewrote the configuration parsing mechanism from Granola Connect (the daemon in Granola Enterprise that handles communication with the Granola REST API) using embedded Python and specifically the ConfigParser module.

Of course, string manipulations and configuration parsing are just an example. For Part 3, I could have chosen any number of tasks that are difficult in C++ and easy in Python (web connectivity, for instance), but the configuration parsing class is a simple yet complete example of embedding Python for something of actual use. Grab the code from the Github repo for this tutorial to play along.

First, let’s create a class definition that covers very basic configuration parsing: read and parse INI-style files, extract string values given a name and a section, and set string values for a given section. Here is the class declaration:

class ConfigParser{
    private:
        boost::python::object conf_parser_;

        void init();
    public:
        ConfigParser();

        bool parse_file(const std::string &filename);
        std::string get(const std::string &attr,
                        const std::string &section = "DEFAULT");
        void set(const std::string &attr,
                 const std::string &value,
                 const std::string &section = "DEFAULT");
};

The ConfigParser module offers far more features than we will cover in this tutorial, but the subset we implement here should serve as a template for implementing more complex functionality. The implementation of the class is fairly simple; first, the constructor loads the main module, extracts the dictionary, imports the ConfigParser module into the namespace, and creates a boost::python::object member variable holding a RawConfigParser object:

ConfigParser::ConfigParser(){
    py::object mm = py::import("__main__");
    py::object mn = mm.attr("__dict__");
    py::exec("import ConfigParser", mn);
    conf_parser_ = py::eval("ConfigParser.RawConfigParser()", mn);
}

The file parsing and the getting and setting of values is performed using this config_parser_ object:

bool ConfigParser::parse_file(const std::string &filename){
    return py::len(conf_parser_.attr("read")(filename)) == 1;
}

std::string ConfigParser::get(const std::string &attr, const std::string &section){
    return py::extract<std::string>(conf_parser_.attr("get")(section, attr));
}

void ConfigParser::set(const std::string &attr, const std::string &value, const std::string &section){
    conf_parser_.attr("set")(section, attr, value);
}

In this simple example, for the sake of brevity exceptions are allowed to propagate. In a more complex environment, you will almost certainly want to have the C++ class handle and repackage the Python exceptions as C++ exceptions. This way you could later create a pure C++ class if performance or some other concern became an issue.

To use the class, calling code can simply treat it as a normal C++ class:

int main(){
    Py_Initialize();
    try{
        ConfigParser parser;
        parser.parse_file("conf_file.1.conf");
        cout << "Directory (file 1): " << parser.get("Directory", "DEFAULT") << endl;
        parser.parse_file("conf_file.2.conf");
        cout << "Directory (file 2): " << parser.get("Directory", "DEFAULT") << endl;
        cout << "Username: " << parser.get("Username", "Auth") << endl;
        cout << "Password: " << parser.get("Password", "Auth") << endl;
        parser.set("Directory", "values can be arbitrary strings", "DEFAULT");
        cout << "Directory (force set by application): " << parser.get("Directory") << endl;
        // Will raise a NoOption exception 
        // cout << "Proxy host: " << parser.get("ProxyHost", "Network") << endl;
    }catch(boost::python::error_already_set const &){
        string perror_str = parse_python_exception();
        cout << "Error during configuration parsing: " << perror_str << endl;
    }
}

And that's that: a key-value configuration parser with sections and comments in under 50 lines of code. This is just the tip of the iceberg too. In almost the same length of code, you can do all sorts of things that would be at best painful and at worse error prone and time consuming in C++: configuration parsing, list and set operations, web connectivity, file format operations (think XML/JSON), and myriad other tasks are already implemented in the Python standard library.

In Part 4, I'll take a look at how to more robustly and generically call Python code using functors and a Python namespace class.

Embedding Python in C++ Applications with boost::python: Part 2

Posted on 04 January 2012 by Joseph

In Part 1, we took a look at embedding Python in C++ applications, including several ways of calling Python code from your application. Though I earlier promised a full implementation of a configuration parser in Part 2, I think it’s more constructive to take a look at error parsing. Once we have a good way to handle errors in Python code, I’ll create the promised configuration parser in Part 3. Let’s jump in!

If you got yourself a copy of the git repo for the tutorial and were playing around with it, you may have experienced the way boost::python handles Python errors – the error_already_set exception type. If not, the following code will generate the exception:

    namespace py = boost::python;
    ...
    Py_Initialize();
    ...
    py::object rand_mod = py::import("fake_module");

…which outputs the not-so-helpful:

terminate called after throwing an instance of 'boost::python::error_already_set'
Aborted

In short, any errors that occur in the Python code that boost::python handles will cause the library to raise this exception; unfortunately, the exception does not encapsulate any of the information about the error itself. To extract information about the error, we’re going to have to resort to using the Python C API and some Python itself. First, catch the error:

    try{
        Py_Initialize();
        py::object rand_mod = py::import("fake_module");
    }catch(boost::python::error_already_set const &){
        std::string perror_str = parse_python_exception();
        std::cout << "Error in Python: " << perror_str << std::endl;
    }

Above, we've called the parse_python_exception function to extract the error string and print it. As this suggests, the exception data is stored statically in the Python library and not encapsulated in the exception itself. The first step in the parse_python_exception function, then, is to extract that data using the PyErr_Fetch Python C API function:

std::string parse_python_exception(){
    PyObject *type_ptr = NULL, *value_ptr = NULL, *traceback_ptr = NULL;
    PyErr_Fetch(&type_ptr, &value_ptr, &traceback_ptr);
    std::string ret("Unfetchable Python error");
    ...

As there may be all, some, or none of the exception data available, we set up the returned string with a fallback value. Next, we try to extract and stringify the type data from the exception information:

    ...
    if(type_ptr != NULL){
        py::handle<> h_type(type_ptr);
        py::str type_pstr(h_type);
        py::extract<std::string> e_type_pstr(type_pstr);
        if(e_type_pstr.check())
            ret = e_type_pstr();
        else
            ret = "Unknown exception type";
    }
    ...

In this block, we first check that there is actually a valid pointer to the type data. If there is, we construct a boost::python::handle to the data from which we then create a str object. This conversion should ensure that a valid string extraction is possible, but to double check we create an extract object, check the object, and then perform the extraction if it is valid. Otherwise, we use a fallback string for the type information.

Next, we perform a very similar set of steps on the exception value:

    ...
    if(value_ptr != NULL){
        py::handle<> h_val(value_ptr);
        py::str a(h_val);
        py::extract<std::string> returned(a);
        if(returned.check())
            ret +=  ": " + returned();
        else
            ret += std::string(": Unparseable Python error: ");
    }
    ...

We append the value string to the existing error string. The value string is, for most built-in exception types, the readable string describing the error.

Finally, we extract the traceback data:

    if(traceback_ptr != NULL){
        py::handle<> h_tb(traceback_ptr);
        py::object tb(py::import("traceback"));
        py::object fmt_tb(tb.attr("format_tb"));
        py::object tb_list(fmt_tb(h_tb));
        py::object tb_str(py::str("\n").join(tb_list));
        py::extract<std::string> returned(tb_str);
        if(returned.check())
            ret += ": " + returned();
        else
            ret += std::string(": Unparseable Python traceback");
    }
    return ret;
}

The traceback goes similarly to the type and value extractions, except for the extra step of formatting the traceback object as a string. For that, we import the traceback module. From traceback, we then extract the format_tb function and call it with the handle to the traceback object. This generates a list of traceback strings which we then join into a single string. Not the prettiest printing, perhaps, but it gets the job done. Finally, we extract the C++ string type as above and append it to the returned error string and return the entire result.

In the context of the earlier error, the application now generates the following output:

Error in Python: : No module named fake_module

Generally speaking, this function will make it much easier to get to the root cause of problems in your embedded Python code. One caveat: if you are configuring a custom Python environment (especially module paths) for your embedded interpreter, the parse_python_exception function may itself throw a boost::error_already_set when it attempts to load the traceback module, so you may want to wrap the call to the function in a try...catch block and parse only the type and value pointers out of the result.

As I mentioned above, in Part 3 I will walk through the implementation of a configuration parser built on top of the ConfigParser Python module. Assuming, of course, that I don't get waylaid again.


Copyright © 2024 Joseph Turner