Cloud has been the new darling in the world of computing and Microsoft has provided once again developers a new compeling platform to build their new generation applications on. In this picture enters SQL Azure, a cloud-based edition of Microsoft’s leading database product, Microsoft SQL Server. One would ask how one designs an RDBMS database for the cloud and specifically for the Azure platform. It is basically easy to create a simple database in SQL Azure but database creation and database design are two different life forms. So we ask, how are databases be in the cloud and in SQL Azure? In this blog, I hope to provide an answer.
Scale Out
First and foremost, the cloud has been one of the primary solutions to consider when an application is in the realm of a massive scalability issue. Or to put it in another perspective, the Cloud was both a catalyst of the emergence of extremely large applications and a result of the webification of a lot of things. Design considerations need to factor in unheard of scales that can span from almost everywhere. Bandwidth needed is of the global scale, hard disk space are in the petabytes, and users are in the millions. When you talk about building an application to handle this scale, one can’t do anymore the usual stuff any one of us used to do. One now needs to think massively in terms of design. E.g. No single piece of hardware can handle the kind of scale I just mentioned. So how does one design a simple database that is intended to scale up to the degree that I just mentioned?
The Problem
Supposing, we have a great idea of developing the next big thing and we named it FreeDiskSpace. FreeDiskSpace allows us to upload massive amounts of files where files can be anything one can think of. The scale of this next big thing is thought to be in the range of having 500 million users scattered around the globe with each users allowed to upload an unlimited amount of files.
The Challenge
With such order of magnitude, the many challenges we will be facing will definitely lead and force us to question every bit of our knowledge and development capability that we might have been using when we are used to doing gigs in the few GBytes of databases with a few thousand users. If what is allowed makes each user upload files in the range of not less than 5GB, one would need to store 2,500 petabytes worth of files. How does one store 2,500 petabytes worth of files and manage the uploading/downloading of 500 million users?
In the Cloud, you are less likely to see a bunch of supercomputers handling this kind of scale. Rather, you are more likely to see hundreds of thousands of commodity servers woven together forming a cohesive unit. Making use of these mesh to behave like it is as if it is operating as a unit requires design patterns that are more appropriate for these environments.
Database Sharding in SQL Azure
First, what is database sharding. To put it simply, database sharding is a design pattern to split up data into shards or fragments where each shard can be run on its own on a separate instance running on a separate server located somewhere. Each shard contains the same structure as that of another but contains different subsets of data.
Current version of SQL Azure offers NO feature that handles automatic scaling. This will surely change in the near future as everybody doing applications handling these massive loads will somehow need an automatic way to allow the application to handle this amount of load and data easily. One way that is guaranteed to be commonly used will be sharding databases. However for now, sharding in SQL Azure isn’t automatic. If one needs to shard, one needs to do it by hand. As I have said, in the future, Microsoft will be compelled to give us an easy way to practice sharding. Hopefully, that future is near.
Now we would ask, why do we need to shard databases in SQL Azure. The cloud and specifically with SQL Azure, with its limitations, forces us to use this design pattern. SQL Azure only allows us 50GB for a database’s maximum size on a virtual instance running on an off the shelf commodity server. With that, we surely need a way to work around these constraints if we see our data to grow for like 500 terabytes. Second, if we foresee a huge number of users hitting our applications, doing this on a single piece of hardware would spawn concurrency nightmares that would choke our databases to death.
Though there are other approaches to solve our scaling challenges, there are other interesting motivating factors where sharding becomes a very compelling option. Firstly, it is more likely for a database server to process small databases faster than larger ones. Secondly, by utilizing multiple instances of computing power, we can spread processing load across these instances to more processes can be executed all at once.
How To Shard
Let us try to go back with our FreeDiskSpace killer app (assuming we provide each user up 1GB in database space) and practice some very simple sharding so we have a partial understanding of what sharding is all about. If we are to design the database using a single instance, the table would look like this:
Server: ABC | |||
Database: DB1 | |||
Table: userfiles | |||
userfiles_pk | user | uploaded file | country_code |
1 | john | file1 | grmy |
2 | john | file2 | grmy |
3 | allan | file1 | can |
4 | allan | file2 | can |
5 | allan | file3 | can |
6 | edu | file1 | grmy |
7 | roman | file1 | ity |
8 | roman | file2 | ity |
9 | jon | file1 | can |
10 | jon | file2 | can |
11 | jon | file3 | can |
Server: ABC | |
Database: DB1 | |
Table: country | |
country_code | country |
grmy | germany |
can | canada |
ity | italy |
If we are to shard, we need to decide on how we split up our database into logical, manageable fragments. We then decide what data is going to occupy each fragment (we can probably call the occupier our “tenant”). Based on tables we see above, there are 2 fields that we can potentially use as keys for sharding. These fields are user and country. These fields should provide answers to a question: How to we sensibly split up our database? Do we have to split it up by user, or by country? Both are actually qualified to be used as basis for splitting up our database. We can investigate each though.
Sharding by field: User
Server: ABC | |||
Database: DB1 | |||
Table: userfiles | |||
userfiles_pk | user | uploaded file | country_code |
1 | john | file1 | grmy |
2 | john | file2 | grmy |
Server: ABC | |
Database: DB1 | |
Table: country | |
country_code | country |
grmy | germany |
can | canada |
ity | italy |
Server: CDE | |||
Database: DB1 | |||
Table: userfiles | |||
userfiles_pk | user | uploaded file | country_code |
3 | allan | file1 | can |
4 | allan | file2 | can |
5 | allan | file3 | can |
Server: CDE | |
Database: DB1 | |
Table: country | |
country_code | country |
grmy | germany |
can | canada |
ity | italy |
By deciding to make each user as a tenant in our database, the following statements would be true:
- Our shard would only occupy data related to a single user.
- Various shards can be in the same server. Shards don’t need to be in separate servers.
- It would render country table a bit useless and awkward. But we can always decide to get rid of it and just maintain a column indicating a user’s country of origin.
- Now with a one-user-one-database structure, 1 million users mean 1 million separate databases. In SQL Azure’s terms, assuming a 1GB costs us 10$ a month, designing our shards by user means we have to spend $10 million monthly. Now that is no joke!
Sharding by field: Country
Server: ABC | |||
Database: DB1 | |||
Table: userfiles | |||
userfiles_pk | user | uploaded file | country_code |
1 | john | file1 | grmy |
2 | john | file2 | grmy |
6 | edu | file1 | grmy |
Server: ABC | |
Database: DB1 | |
Table: country | |
country_code | country |
grmy | germany |
can | canada |
ity | italy |
Server: CDE | |||
Database: DB1 | |||
Table: userfiles | |||
userfiles_pk | user | uploaded file | country_code |
3 | allan | file1 | can |
4 | allan | file2 | can |
5 | allan | file3 | can |
9 | jon | file1 | can |
10 | jon | file2 | can |
11 | jon | file3 | can |
Server: CDE | |
Database: DB1 | |
Table: country | |
country_code | country |
grmy | germany |
can | canada |
ity | italy |
By deciding to make country as a tenant for each of our database, the following statements would be true:
- Our shard would only occupy data related to a country.
- Various shards can be in the same server. Shards don’t need to be in separate servers.
- It would render country table unnecessary as each database represent a country. But we can always decide to get rid of this table.
- Asnwer.com says there are only 196 countries in the world. With a one-country-one-database structure, this means, if we are getting the larger 10GB, 100$ a month SQL Azure databases, we only need to spend $19,600 a month.
- This design decision though is pretty useful until we ran out of space in our 10GB design to contain all users related to a country. There are solutions to this problem though. However, I will try to discuss this in my future blogs.
It is pretty clear that picking the right key for sharding has obvious ramifications. It is very important for us to determine which ones are appropriate for our needs. It is obvious that picking country is the most cost-effective in this scenario while picking user could shield you much longer on the 10GB limit problem.
This article only presents partial concepts of sharding but nevertheless has given one the very basic idea of how database sharding can be applied, and how all these translate when you choose SQL Azure as a platform.
Thoughts on Designing Databases for SQL Azure – Part 2
**************************************************
Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com