Thoughts on Designing Databases for SQL Azure

Cloud has been the new darling in the world of computing and Microsoft has provided once again developers a new compeling platform to build their new generation applications on. In this picture enters SQL Azure, a cloud-based edition of Microsoft’s leading database product, Microsoft SQL Server. One would ask how one designs an RDBMS database for the cloud and specifically for the Azure platform. It is basically easy to create a simple database in SQL Azure but database creation and database design are two different life forms. So we ask, how are databases be in the cloud and in SQL Azure? In this blog, I hope to provide an answer.

Scale Out

First and foremost, the cloud has been one of the primary solutions to consider when an application is in the realm of a massive scalability issue. Or to put it in another perspective, the Cloud was both a catalyst of the emergence of extremely large applications and a result of the webification of a lot of things. Design considerations need to factor in unheard of scales that can span from almost everywhere. Bandwidth needed is of the global scale, hard disk space are in the petabytes, and users are in the millions. When you talk about building an application to handle this scale, one can’t do anymore the usual stuff any one of us used to do. One now needs to think massively in terms of design. E.g. No single piece of hardware can handle the kind of scale I just mentioned. So how does one design a simple database that is intended to scale up to the degree that I just mentioned?

The Problem

Supposing, we have a great idea of developing the next big thing and we named it FreeDiskSpace. FreeDiskSpace allows us to upload massive amounts of files where files can be anything one can think of. The scale of this next big thing is thought to be in the range of having 500 million users scattered around the globe with each users allowed to upload an unlimited amount of files.

The Challenge

With such order of magnitude, the many challenges we will be facing will definitely lead and force us to question every bit of our knowledge and development capability that we might have been using when we are used to doing gigs in the few GBytes of databases with a few thousand users. If what is allowed makes each user upload files in the range of not less than 5GB, one would need to store 2,500 petabytes worth of files. How does one store 2,500 petabytes worth of files and manage the uploading/downloading of 500 million users?

In the Cloud, you are less likely to see a bunch of supercomputers handling this kind of scale. Rather, you are more likely to see hundreds of thousands of commodity servers woven together forming a cohesive unit. Making use of these mesh to behave like it is as if it is operating as a unit requires design patterns that are more appropriate for these environments.

Database Sharding in SQL Azure

First, what is database sharding. To put it simply, database sharding is a design pattern to split up data into shards or fragments where each shard can be run on its own on a separate instance running on a separate server located somewhere. Each shard contains the same structure as that of another but contains different subsets of data.

Current version of SQL Azure offers NO feature that handles automatic scaling. This will surely change in the near future as everybody doing applications handling these massive loads will somehow need an automatic way to allow the application to handle this amount of load and data easily.  One way that is guaranteed to be commonly used will be sharding databases. However for now, sharding in SQL Azure isn’t automatic. If one needs to shard, one needs to do it by hand. As I have said, in the future, Microsoft will be compelled to give us an easy way to practice sharding. Hopefully, that future is near.

Now we would ask, why do we need to shard databases in SQL Azure. The cloud and specifically with SQL Azure, with its limitations, forces us to use this design pattern. SQL Azure only allows us 50GB for a database’s maximum size on a virtual instance running on an off the  shelf commodity server. With that, we surely need a way to work around these constraints if we see our data to grow for like 500 terabytes. Second, if we foresee a huge number of users hitting our applications, doing this on a single piece of hardware would spawn concurrency nightmares that would choke our databases to death.

Though there are other approaches to solve our scaling challenges, there are other interesting motivating factors where sharding becomes a very compelling option. Firstly, it is more likely for a database server to process small databases faster than larger ones. Secondly, by utilizing multiple instances of computing power, we can spread processing load across these instances to more processes can be executed all at once.

How To Shard

Let us try to go back with our FreeDiskSpace killer app (assuming we provide each user up 1GB in database space) and practice some very simple sharding so we have a partial understanding of what sharding is all about. If we are to design the database using a single instance, the table would look like this:

Server: ABC
Database: DB1
Table: userfiles
userfiles_pk user uploaded file country_code
1 john file1 grmy
2 john file2 grmy
3 allan file1 can
4 allan file2 can
5 allan file3 can
6 edu file1 grmy
7 roman file1 ity
8 roman file2 ity
9 jon file1 can
10 jon file2 can
11 jon file3 can
Server: ABC
Database: DB1
Table: country
country_code country
grmy germany
can canada
ity italy

If we are to shard, we need to decide on how we split up our database into logical, manageable fragments. We then decide what data is going to occupy each fragment (we can probably call the occupier our “tenant”). Based on tables we see above, there are 2 fields that we can potentially use as keys for sharding. These fields are user and country. These fields should provide answers to a question: How to we sensibly split up our database? Do we have to split it up by user, or by country? Both are actually qualified to be used as basis for splitting up our database. We can investigate each though.

Sharding by field: User

Server: ABC
Database: DB1
Table: userfiles
userfiles_pk user uploaded file country_code
1 john file1 grmy
2 john file2 grmy
Server: ABC
Database: DB1
Table: country
country_code country
grmy germany
can canada
ity italy
Server: CDE
Database: DB1
Table: userfiles
userfiles_pk user uploaded file country_code
3 allan file1 can
4 allan file2 can
5 allan file3 can
Server: CDE
Database: DB1
Table: country
country_code country
grmy germany
can canada
ity italy

By deciding to make each user as a tenant in our database, the following statements would be true:

  • Our shard would only occupy data related to a single user.
  • Various shards can be in the same server. Shards don’t need to be in separate servers.
  • It would render country table a bit useless and awkward. But we can always decide to get rid of it and just maintain a column indicating a user’s country of origin.
  • Now with a one-user-one-database structure, 1 million users mean 1 million separate databases. In SQL Azure’s terms, assuming a 1GB costs us 10$ a month, designing our shards by user means we have to spend $10 million monthly. Now that is no joke!

Sharding by field: Country

Server: ABC
Database: DB1
Table: userfiles
userfiles_pk user uploaded file country_code
1 john file1 grmy
2 john file2 grmy
6 edu file1 grmy
Server: ABC
Database: DB1
Table: country
country_code country
grmy germany
can canada
ity italy
Server: CDE
Database: DB1
Table: userfiles
userfiles_pk user uploaded file country_code
3 allan file1 can
4 allan file2 can
5 allan file3 can
9 jon file1 can
10 jon file2 can
11 jon file3 can
Server: CDE
Database: DB1
Table: country
country_code country
grmy germany
can canada
ity italy

By deciding to make country as a tenant for each of our database, the following statements would be true:

  • Our shard would only occupy data related to a country.
  • Various shards can be in the same server. Shards don’t need to be in separate servers.
  • It would render country table unnecessary as each database represent a country. But we can always decide to get rid of this table.
  • Asnwer.com says there are only 196 countries in the world. With a one-country-one-database structure,  this means,  if we are getting the larger 10GB, 100$ a month SQL Azure databases, we only need to spend $19,600 a month.
  • This design decision though is pretty useful until we ran out of space in our 10GB design to contain all users related to a country. There are solutions to this problem though. However, I will try to discuss this in my future blogs.

It is pretty clear that picking the right key for sharding has obvious ramifications. It is very important for us to determine which ones are appropriate for our needs. It is obvious that picking country is the most cost-effective in this scenario while picking user could shield you much longer on the 10GB limit problem.

This article only presents partial concepts of sharding but nevertheless has given one the very basic idea of how database sharding can be applied, and how all these translate when you choose SQL Azure as a platform.

Thoughts on Designing Databases for SQL Azure – Part 2

**************************************************
Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com

Advertisement

Toto Pictures : Surfin Urbiztundo

I shot these photos last November 29, 2008 during the La Union Grommet Championship 2008 in Urbiztundo, San Juan, La Union. I am actually not a sports photographer but when I get to shoot them, I found that they are very excellent subjects. I enjoyed shooting them all day long when I was there.

It has been some time now and I really want to photograph them again. Meanwhile, here are a few shots of surfing young kids that are 16 and below (groms) that participated in the tournament. Attached are the settings I use for these shots.


Nikon D300, Sigma 150-500mm, Ballhead + Tripod
Shot at 500mm, f7.1, 1/500 secs, ISO-200


Nikon D300, Sigma 150-500mm, Ballhead + Tripod
Shot at 500mm, f7.1, 1/500 secs, ISO-200


Nikon D300, Sigma 150-500mm, Ballhead + Tripod
Shot at 500mm, f8, 1/1250 secs, ISO-200

I am hoping to go back one of these days and shoot these surfers again. This time, as a Canon user 🙂

Going Down The Decision Making Lane: Azure vs Hosted vs On-premise

Cloud computing has brought new possibilities and challenges to ISVs like my company. It has also brought along a lot of seemingly difficult questions that has no direct answers. However, for anyone looking at the possibility of investing its future on new technologies, a lot of questions needed some answers.

In this blog, I have come up with a list of concerns, cases, and questions that would usually be essential when choosing the right decision. To make things a lot clearer, I would probably use a real world example based from my own circumstances to weigh things down between competing platforms that are all worth exploring.

My Case

In my last cloud computing related blog (http://totogamboa.com/2010/12/11/my-first-shot-at-sharding-databases/), I mentioned one of my company’s products that is being considered going beyond the usual realm of on-premise applications. In this blog, I continue to explore new possibilities for this application as we enter a new era of cloud computing.

The obvious question for our company now is where to go from here as far as our library system is concerned.

My Options

I have with me several platforms to choose. All is begging to be explored further, all is wanting attention, to be leveraged, used and abused. Since our library system is heavily glued to Microsoft technologies, my shortlist would naturally explore the several platforms where Microsoft technologies claim market and mindshare dominance.

  1. On-Premise. Microsoft boast a powerful array of on-premise technologies that anyone can take advantage of and have their products comfortably sit on. I’d say that I would be looking up for a typical setup of Windows Server, Internet Information Services/ASP.NET, and SQL Server sitting in our client’s datacenter.
  2. Hosted. Microsoft also has proven strongly that their technologies can be set to serve as hosts over the Internet. This can be proven by the large number of providers of hosting sites that offer Microsoft technologies.  Hosting providers typically offer the same set of technologies one can setup on-premise. Typical setup would be Windows Server, Internet Information Services/ASP.NET, and SQL Server hosted in the provider’s datacenters.
  3. Azure. The relatively new kid on the block is the stack of Microsoft technologies designed mainly as services over the Internet. The stack is composed of Azure technologies and the entirety is touted to be the platform of the future. The new setup includes Windows Azure, a web role type supporting IIS/ASP.NET, and SQL Azure. All these are managed by Microsoft and situated in their humungous datacenters scattered around the globe.

My Possibilities

Picking a choice isn’t always easy, especially when matters discussing the right software solution for an individual’s case. But I would like to attempt to see where I would go from here as far as our library system go. With today’s options and the complexity that comes along with each at our disposal, things, like getting a sound technology decisions, are harder to come by than before.

In this exercise, I prepared important points that need to be clarified to come up with the right, if not better solution. Do take note though that I am not generalizing here by just focusing on technologies that Microsoft provides out there. Remember I am working on answers based only on my circumstances and published specifications pulled from the web.

Here is a compilation of my personal list of concerns in trying to go through a decision making process that would shape up the future of our library system.

Concerns Azure (As of Dec 2010) Hosted On-Premise
What hardware do I have to provide? front side only front side only front and end sides
Is redundancy possible? Yes Yes Yes
Is scaling up possible? Up to 8 CPU (8 x 1.6GHz) only16GB RAM Yes by special arrangement Best money can buy
Is scaling out possible? Yes Yes Yes
Is scaling down possible? Yes No No1
Is Full Text Search available? No Yes Yes
Is database backup & restore for SQL Server available? No. Only possible via SSIS. Yes Yes
Need to rewrite library system? Yes No No
How do I scale out? SQL Azure / sharding SQL Server / sharding SQL Server / sharding / horizontal partitioning
Administration/operation of infrastructure Done by Microsoft Done by provider Done by client/us
Visual Studio Support Yes Yes Yes
ASP.NET Webforms or MVC Support Yes Yes Yes
SQL Server Support SQL Azure SQL Server SQL Server
SQL Server Degree of Capability Subset of SQL Server only Full feature Full feature
Platform Portability If I write on Azure, it will run only on Azure, unless there is a small scale on-premise Azure that can be deployed in the future. If I write on Hosted Environment, it can run also On-Premise If I write On-Premise, it can also run Hosted

1 There is wastage of hardware resources if system is not designed to scale down. Hardware stays when capacity is scaled down. It only remains underutilized.

Summing It Up

At this point in time, the best option out there for my company is to remain in Hosted environments, as seen in information provided in the table of concerns above. All of my requirements needed from SQL Server are available. It is simply that there are still the missing pieces in SQL Azure that prevents us from fully adopting it. The company should also start distancing itself from on-premise implementations. This doesn’t mean though that both Azure and on-premise options are no good. Azure is shaping up to be a formidable platform of choice, especially for projects that are yet to be developed. Soon, and I am quite sure fulltext search capability will be supported in SQL Azure. When that happens, I will have to go through this same exercise all over again.

SQL Azure

In the end, as an database person, it has always been the issue of which RDBMSes may provide for me in various platforms available out there. For now, it is still the non-cloud version of SQL Server that seals the deal for me. SQL Azure isn’t far behind. It is just a fulltext search away! 🙂

**************************************************
Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com