The following post is a guest post by Chris Bunch, a Computer Science Ph.D. student at the University of California, Santa Barbara. He is one of the student leads on the AppScale project, an open source Google App Engine compatible hosting solution led by Professor Chandra Krintz. Chris has developed and maintained AppScale as a research project over the last two years with fellow student lead Navraj Chohan and others.
--------- Over here at the
UCSB Racelab, we've complained endlessly about finding a web framework we actually could use. For a long time we thought we just wouldn't be able to find it - many were so-so or good but only after a substantial learning curve. So imagine our surprise back in April 2008 when we heard about what we thought would be just-another-web-framework provided by Google in the Python version of App Engine. But after giving it a try, we were smitten. We finally found a web framework that (1) we could actually use on non-trivial projects and (2) we could teach in nine-week classes without having students lose half the time with the idiosyncrasies of the programming language involved or the web framework itself. Furthermore, the minimalistic APIs make it simple to get work done: it did for us exactly what we needed and nothing else.
Yet as researchers and hackers-at-heart there was one thing that we really wanted to do with App Engine that we couldn't do: run it on a whole bunch of our machines and tinker with it. A similarly-minded hacker named Chris Anderson had released
AppDrop, which was a modified version of the App Engine SDK that hooked up to PostgresSQL and run in Amazon EC2, but only ran over a single machine. So after much discussion, we came up with the following short list of things we wanted to do with App Engine:
- We wanted to run it on our own virtual machines or those running in Eucalyptus or Amazon EC2 in order to investigate how we can optimally harness cloud infrastructures in our cloud platform.
- Tons of new datastores have emerged as part of the "NoSQL" movement, and we need a mechanism to evaluate their performance under controlled experiments as well as traditional databases such as MySQL. We also need a platform that supports the ability to add new data storage mechanisms so that when developers tout the features of their new datastore, we can download it and evaluate it under similar circumstances as other datastores.
- One of the reasons we love Google App Engine is the simple set of APIs provided, but we also wanted to use that as a starting point where we could add new APIs and control the environment in which they run.
- We love that Google App Engine "just works". You don't know where it's running and how it's running, but you can see that it is running, and we wanted to make sure that whatever we developed, that it did the same. We wanted to develop something that automatically deployed your App Engine app and configured everything for you. Expert users should be able to have more control over the system, but the system should be able to handle your app from the moment you deploy it to the moment you tear it down.
- It had to be open-source - just like how we wanted something to tinker with and run experiments on, we wanted it to be something that you could tinker with too. We wanted you to be able to add in support for a database you're interested in and see how it performs, and we wanted you to be able to add in APIs that you think would be interesting to have an easy-to-use web framework interact with.
So with that in mind, we created
AppScale, an open-source cloud platform for Google App Engine applications. Here's how we did it:
We took the standard three-tier web deployment approach and clearly segmented each tier into a specific component in the system: an AppLoadBalancer routes users to their applications, an AppServer runs the user's App Engine app, and an AppDB handles database interactions. Each have clearly defined roles in the system and are controlled by an AppController, a daemon that runs on each machine, monitors each component, and controls the specific order in which services are started. It writes all the configuration files for each service and coordinates services between the other AppControllers in the deployment. For those interested, we detail the specifics on the original AppScale
implementation in this paper.
We also wanted to embody the principle of "standing on the shoulders of giants", and as such, we employ open-source software as often as possible, where appropriate. Our AppLoadBalancer employs the
nginx web server as well as the
haproxy load balancer to ensure high performance. Our Memcache API implementation uses memcache under the hood, while our
MapReduce API uses
Apache Hadoop, which we added to give App Engine users running over AppScale the ability to run Hadoop MapReduce jobs from within their web applications.
Because we were able to keep the database support abstracted away from the other components in the system, we were able to add support for nine different data storage solutions within AppScale: HBase, Hypertable, MySQL, Cassandra, Voldemort, MongoDB, MemcacheDB, Scalaris, and SimpleDB. Many of these databases have seen interest in recent years but have been hard to measure under comparable conditions, and vary greatly. To give a few examples, they vary in the query languages they provide, their topologies (e.g., master / slave, peer-to-peer), data consistency policies, and end-user library interfaces. This has made it non-trivial for the community to objectively determine scenarios in which one database performs better or worse than another and investigate why, but under AppScale, deploying all these databases is done automatically with no interaction from the user. And because AppScale is open-source, if a developer doesn't like the particular interface we use for a database, they can improve on it and give back to the community. We've used AppScale internally to
evaluate the performance of Google App Engine applications on these datastores as well as developed an App Engine app,
Active Cloud DB, that exposes a RESTful API that developers can use to access these datastores from any programming language or web framework.
Finally, the most important lesson we learned was the value of incremental development. Our core development team fluctuates between two to three developers, so from the first meeting we had, we knew that our very first release couldn't support every App Engine API nor could it run nine databases seamlessly. Therefore, we started off with support for the two BigTable clones, HBase and Hypertable, as well as support for just the Datastore API, the URL Fetch API, and the Users API within App Engine. From there, we learned what datastores people actually wanted to see support for as well as what APIs people wanted to use. We were also able to add APIs within App Engine apps deployed to AppScale to be able to run virtual machines under the
EC2 API, while also running computation under the
MapReduce API.
But developing AppScale was certainly not a cakewalk for us. Over the course of the last two years, five major issues (some technical and some not) have arisen within the project:
- Writing software that works without knowing ahead of time how many machines will be in the system proved initially to be difficult to grasp, but in many cases we were able to reduce the number of variations that could occur and use that to provide some predictability with respect to how we configure and deploy databases and applications.
- We couldn't assume that the AppScale administrator has access to DNS; without it, a number of APIs and features are extremely difficult to implement. Load balancing is much more difficult, and many APIs that are tied to host names must be tied to one machine in the system, else they don't work properly. VLAN tagging shows some promise to alleviate these problems, but right now is far from being deployed inexpensively and easily.
- The source code for the Java version of App Engine isn't publicly available, so we had to spend a lot of time decompiling the SDK, modifying it to use our database and our API implementations instead of the SDK implementations, and recompiling it. All of these were non-trivial and greatly added to the time it took for us to deploy a version of AppScale with Java App Engine support.
- Not all users want a pre-built virtual machine image, so ensuring that building the AppScale environment was done right every time was a top priority. We had to limit ourselves to Ubuntu Jaunty for many releases, and only recently were we able to expand to include Karmic and Lucid, which still make up a microcosm of the distributions available in the Linux world. Adding the ability to install AppScale via apt-get in these specific Linux distributions has also been a crucial step in making sure that users could easily and quickly install AppScale for use.
- Both undergraduates and graduate students here at UCSB have done projects involving AppScale, which means that the number and experience levels of developers working on AppScale is completely unpredictable at a given moment in time. Oftentimes the projects they work on are only tangentially related to features that users want, and the time scales that they are available to work for is vastly different than most software engineers are used to.
All of these problems are greatly exacerbated by only having a two-to-three person core developer team, but this also makes the AppScale project particularly interesting to work on. Despite having worked on AppScale for two years, there are still tons of interesting problems to work on and we still love the Python App Engine web framework as much as we did when we first picked it up. And of course, AppScale is open-source, under the New BSD License, so feel free to download it and tinker around like we have! Check out AppScale at:
http://appscale.cs.ucsb.edu
http://code.google.com/p/appscale