Thoughts on Designing Databases for SQL Azure

Cloud has been the new darling in the world of computing and Microsoft has provided once again developers a new compeling platform to build their new generation applications on. In this picture enters SQL Azure, a cloud-based edition of Microsoft’s leading database product, Microsoft SQL Server. One would ask how one designs an RDBMS database for the cloud and specifically for the Azure platform. It is basically easy to create a simple database in SQL Azure but database creation and database design are two different life forms. So we ask, how are databases be in the cloud and in SQL Azure? In this blog, I hope to provide an answer.

Scale Out

First and foremost, the cloud has been one of the primary solutions to consider when an application is in the realm of a massive scalability issue. Or to put it in another perspective, the Cloud was both a catalyst of the emergence of extremely large applications and a result of the webification of a lot of things. Design considerations need to factor in unheard of scales that can span from almost everywhere. Bandwidth needed is of the global scale, hard disk space are in the petabytes, and users are in the millions. When you talk about building an application to handle this scale, one can’t do anymore the usual stuff any one of us used to do. One now needs to think massively in terms of design. E.g. No single piece of hardware can handle the kind of scale I just mentioned. So how does one design a simple database that is intended to scale up to the degree that I just mentioned?

The Problem

Supposing, we have a great idea of developing the next big thing and we named it FreeDiskSpace. FreeDiskSpace allows us to upload massive amounts of files where files can be anything one can think of. The scale of this next big thing is thought to be in the range of having 500 million users scattered around the globe with each users allowed to upload an unlimited amount of files.

The Challenge

With such order of magnitude, the many challenges we will be facing will definitely lead and force us to question every bit of our knowledge and development capability that we might have been using when we are used to doing gigs in the few GBytes of databases with a few thousand users. If what is allowed makes each user upload files in the range of not less than 5GB, one would need to store 2,500 petabytes worth of files. How does one store 2,500 petabytes worth of files and manage the uploading/downloading of 500 million users?

In the Cloud, you are less likely to see a bunch of supercomputers handling this kind of scale. Rather, you are more likely to see hundreds of thousands of commodity servers woven together forming a cohesive unit. Making use of these mesh to behave like it is as if it is operating as a unit requires design patterns that are more appropriate for these environments.

Database Sharding in SQL Azure

First, what is database sharding. To put it simply, database sharding is a design pattern to split up data into shards or fragments where each shard can be run on its own on a separate instance running on a separate server located somewhere. Each shard contains the same structure as that of another but contains different subsets of data.

Current version of SQL Azure offers NO feature that handles automatic scaling. This will surely change in the near future as everybody doing applications handling these massive loads will somehow need an automatic way to allow the application to handle this amount of load and data easily.  One way that is guaranteed to be commonly used will be sharding databases. However for now, sharding in SQL Azure isn’t automatic. If one needs to shard, one needs to do it by hand. As I have said, in the future, Microsoft will be compelled to give us an easy way to practice sharding. Hopefully, that future is near.

Now we would ask, why do we need to shard databases in SQL Azure. The cloud and specifically with SQL Azure, with its limitations, forces us to use this design pattern. SQL Azure only allows us 50GB for a database’s maximum size on a virtual instance running on an off the  shelf commodity server. With that, we surely need a way to work around these constraints if we see our data to grow for like 500 terabytes. Second, if we foresee a huge number of users hitting our applications, doing this on a single piece of hardware would spawn concurrency nightmares that would choke our databases to death.

Though there are other approaches to solve our scaling challenges, there are other interesting motivating factors where sharding becomes a very compelling option. Firstly, it is more likely for a database server to process small databases faster than larger ones. Secondly, by utilizing multiple instances of computing power, we can spread processing load across these instances to more processes can be executed all at once.

How To Shard

Let us try to go back with our FreeDiskSpace killer app (assuming we provide each user up 1GB in database space) and practice some very simple sharding so we have a partial understanding of what sharding is all about. If we are to design the database using a single instance, the table would look like this:

Server: ABC
Database: DB1
Table: userfiles
userfiles_pk user uploaded file country_code
1 john file1 grmy
2 john file2 grmy
3 allan file1 can
4 allan file2 can
5 allan file3 can
6 edu file1 grmy
7 roman file1 ity
8 roman file2 ity
9 jon file1 can
10 jon file2 can
11 jon file3 can
Server: ABC
Database: DB1
Table: country
country_code country
grmy germany
can canada
ity italy

If we are to shard, we need to decide on how we split up our database into logical, manageable fragments. We then decide what data is going to occupy each fragment (we can probably call the occupier our “tenant”). Based on tables we see above, there are 2 fields that we can potentially use as keys for sharding. These fields are user and country. These fields should provide answers to a question: How to we sensibly split up our database? Do we have to split it up by user, or by country? Both are actually qualified to be used as basis for splitting up our database. We can investigate each though.

Sharding by field: User

Server: ABC
Database: DB1
Table: userfiles
userfiles_pk user uploaded file country_code
1 john file1 grmy
2 john file2 grmy
Server: ABC
Database: DB1
Table: country
country_code country
grmy germany
can canada
ity italy
Server: CDE
Database: DB1
Table: userfiles
userfiles_pk user uploaded file country_code
3 allan file1 can
4 allan file2 can
5 allan file3 can
Server: CDE
Database: DB1
Table: country
country_code country
grmy germany
can canada
ity italy

By deciding to make each user as a tenant in our database, the following statements would be true:

  • Our shard would only occupy data related to a single user.
  • Various shards can be in the same server. Shards don’t need to be in separate servers.
  • It would render country table a bit useless and awkward. But we can always decide to get rid of it and just maintain a column indicating a user’s country of origin.
  • Now with a one-user-one-database structure, 1 million users mean 1 million separate databases. In SQL Azure’s terms, assuming a 1GB costs us 10$ a month, designing our shards by user means we have to spend $10 million monthly. Now that is no joke!

Sharding by field: Country

Server: ABC
Database: DB1
Table: userfiles
userfiles_pk user uploaded file country_code
1 john file1 grmy
2 john file2 grmy
6 edu file1 grmy
Server: ABC
Database: DB1
Table: country
country_code country
grmy germany
can canada
ity italy
Server: CDE
Database: DB1
Table: userfiles
userfiles_pk user uploaded file country_code
3 allan file1 can
4 allan file2 can
5 allan file3 can
9 jon file1 can
10 jon file2 can
11 jon file3 can
Server: CDE
Database: DB1
Table: country
country_code country
grmy germany
can canada
ity italy

By deciding to make country as a tenant for each of our database, the following statements would be true:

  • Our shard would only occupy data related to a country.
  • Various shards can be in the same server. Shards don’t need to be in separate servers.
  • It would render country table unnecessary as each database represent a country. But we can always decide to get rid of this table.
  • Asnwer.com says there are only 196 countries in the world. With a one-country-one-database structure,  this means,  if we are getting the larger 10GB, 100$ a month SQL Azure databases, we only need to spend $19,600 a month.
  • This design decision though is pretty useful until we ran out of space in our 10GB design to contain all users related to a country. There are solutions to this problem though. However, I will try to discuss this in my future blogs.

It is pretty clear that picking the right key for sharding has obvious ramifications. It is very important for us to determine which ones are appropriate for our needs. It is obvious that picking country is the most cost-effective in this scenario while picking user could shield you much longer on the 10GB limit problem.

This article only presents partial concepts of sharding but nevertheless has given one the very basic idea of how database sharding can be applied, and how all these translate when you choose SQL Azure as a platform.

Thoughts on Designing Databases for SQL Azure – Part 2

**************************************************
Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com

Advertisement

Going Down The Decision Making Lane: Azure vs Hosted vs On-premise

Cloud computing has brought new possibilities and challenges to ISVs like my company. It has also brought along a lot of seemingly difficult questions that has no direct answers. However, for anyone looking at the possibility of investing its future on new technologies, a lot of questions needed some answers.

In this blog, I have come up with a list of concerns, cases, and questions that would usually be essential when choosing the right decision. To make things a lot clearer, I would probably use a real world example based from my own circumstances to weigh things down between competing platforms that are all worth exploring.

My Case

In my last cloud computing related blog (http://totogamboa.com/2010/12/11/my-first-shot-at-sharding-databases/), I mentioned one of my company’s products that is being considered going beyond the usual realm of on-premise applications. In this blog, I continue to explore new possibilities for this application as we enter a new era of cloud computing.

The obvious question for our company now is where to go from here as far as our library system is concerned.

My Options

I have with me several platforms to choose. All is begging to be explored further, all is wanting attention, to be leveraged, used and abused. Since our library system is heavily glued to Microsoft technologies, my shortlist would naturally explore the several platforms where Microsoft technologies claim market and mindshare dominance.

  1. On-Premise. Microsoft boast a powerful array of on-premise technologies that anyone can take advantage of and have their products comfortably sit on. I’d say that I would be looking up for a typical setup of Windows Server, Internet Information Services/ASP.NET, and SQL Server sitting in our client’s datacenter.
  2. Hosted. Microsoft also has proven strongly that their technologies can be set to serve as hosts over the Internet. This can be proven by the large number of providers of hosting sites that offer Microsoft technologies.  Hosting providers typically offer the same set of technologies one can setup on-premise. Typical setup would be Windows Server, Internet Information Services/ASP.NET, and SQL Server hosted in the provider’s datacenters.
  3. Azure. The relatively new kid on the block is the stack of Microsoft technologies designed mainly as services over the Internet. The stack is composed of Azure technologies and the entirety is touted to be the platform of the future. The new setup includes Windows Azure, a web role type supporting IIS/ASP.NET, and SQL Azure. All these are managed by Microsoft and situated in their humungous datacenters scattered around the globe.

My Possibilities

Picking a choice isn’t always easy, especially when matters discussing the right software solution for an individual’s case. But I would like to attempt to see where I would go from here as far as our library system go. With today’s options and the complexity that comes along with each at our disposal, things, like getting a sound technology decisions, are harder to come by than before.

In this exercise, I prepared important points that need to be clarified to come up with the right, if not better solution. Do take note though that I am not generalizing here by just focusing on technologies that Microsoft provides out there. Remember I am working on answers based only on my circumstances and published specifications pulled from the web.

Here is a compilation of my personal list of concerns in trying to go through a decision making process that would shape up the future of our library system.

Concerns Azure (As of Dec 2010) Hosted On-Premise
What hardware do I have to provide? front side only front side only front and end sides
Is redundancy possible? Yes Yes Yes
Is scaling up possible? Up to 8 CPU (8 x 1.6GHz) only16GB RAM Yes by special arrangement Best money can buy
Is scaling out possible? Yes Yes Yes
Is scaling down possible? Yes No No1
Is Full Text Search available? No Yes Yes
Is database backup & restore for SQL Server available? No. Only possible via SSIS. Yes Yes
Need to rewrite library system? Yes No No
How do I scale out? SQL Azure / sharding SQL Server / sharding SQL Server / sharding / horizontal partitioning
Administration/operation of infrastructure Done by Microsoft Done by provider Done by client/us
Visual Studio Support Yes Yes Yes
ASP.NET Webforms or MVC Support Yes Yes Yes
SQL Server Support SQL Azure SQL Server SQL Server
SQL Server Degree of Capability Subset of SQL Server only Full feature Full feature
Platform Portability If I write on Azure, it will run only on Azure, unless there is a small scale on-premise Azure that can be deployed in the future. If I write on Hosted Environment, it can run also On-Premise If I write On-Premise, it can also run Hosted

1 There is wastage of hardware resources if system is not designed to scale down. Hardware stays when capacity is scaled down. It only remains underutilized.

Summing It Up

At this point in time, the best option out there for my company is to remain in Hosted environments, as seen in information provided in the table of concerns above. All of my requirements needed from SQL Server are available. It is simply that there are still the missing pieces in SQL Azure that prevents us from fully adopting it. The company should also start distancing itself from on-premise implementations. This doesn’t mean though that both Azure and on-premise options are no good. Azure is shaping up to be a formidable platform of choice, especially for projects that are yet to be developed. Soon, and I am quite sure fulltext search capability will be supported in SQL Azure. When that happens, I will have to go through this same exercise all over again.

SQL Azure

In the end, as an database person, it has always been the issue of which RDBMSes may provide for me in various platforms available out there. For now, it is still the non-cloud version of SQL Server that seals the deal for me. SQL Azure isn’t far behind. It is just a fulltext search away! 🙂

**************************************************
Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com

Rationalizing Cloud Computing

For the past couple of years, people have been hot on cloud computing bandwagon. And as in any case of a birth of a new trend, people tend to get into misconceptions that eventually led them to get burned. Some just plainly embrace something without bothering to find out what is in it for them.

As a co-owner of a very small ISV trying to grapple with trends in computing and making things work for clients and stakeholders, I would always submit to the pressures of looking into the possibility of taking advantage of what a new trend can offer, cloud computing included. And during these times, I have significantly looked into what cloud computing brings to the table and how my company can take advantage of them. I have at least looked into Azure and non-Azure offerings and have attempted to assemble a design for our company’s products that can take advantage of the good stuff that are in the cloud.

Though my company is definitely a cloud computing enthusiast as some efforts are actually being done to take advantage of the cloud, as far as I am concerned, it won’t be for everybody. But recently, there is a rise of people who are investigating their possibilities with the cloud. This is maybe due to the advent of Microsoft’s farm betting Azure or perhaps there are just too many misconceptions peddled here and there. In most cases, I always see some wrong assumptions being tossed into every cloud discussion that I have had. Clients too are asking me about possibilities of being in the cloud based on wrong assumptions fed to them.

But before I get into those misconceptions, I would highlight the traits that made me and my company embrace cloud computing:

  • Cloud is the great equalizer. For very small ISVs like my company, I can see cloud as a leveling of playing field against the big guys, at least on the issue on having to spend upfront on system infrastructure to offer something for my market. For under $100 a month, a small ISV can start big.
  • Reach. My company would be able to increase reach without much overhead.
  • Scalability. The fact that I would be able to scale as the need arises, cloud computing is definitely for companies like mine.
  • Administration. Being a very small ISV, system administration tasks such as deployment, support and maintenance takes too much toll on our internal resources. Having an easily accessible uniform environment like the cloud, it would allow my company to increase the number of clients to attend to without adding too much stress on our internal resource.

There are other niceties that made me embrace cloud computing, but the above items mentioned are the major factors that I believe can only be achieved through the cloud.

As convinced as I am that cloud computing is one area that my company should invest, I am also convinced that cloud computing is definitely not for everybody and definitely not for every situation out there, at least for its current incarnation. I’d probably list down 5 misconceptions that I often encounter. My rationalization of these misconceptions will be based on what I know based on the efforts my company is doing in the cloud. These includes mostly Azure based clouds as well as few non Azure ones.

Top 5 Misconceptions I Often Encounter

  1. When I move my existing application from on-premise to cloud, my app will become scalable. Well, NOT without rewriting your apps to scale-out and be cloud-ready. In the cloud, scaling is not all automatic. For example, If you have an application that is not designed to scale out, it will not scale-out in the cloud. By scaling out, you have to know how to partition your storage in multiple storage instances. And instances can be spread geographically across continents. To scale-out, one has to start from scale-out design patterns, probably from scratch. Lucky you are if your current application is designed to scale-out.
  2. I can easily move current on-premise application to the cloud. Contrary to peddled notion that you can easily move current on-premise application (legacy) to the cloud, it is really based on what you actually expect to benefit from the cloud. Like in Azure for example, it supports moving legacy applications to the cloud by using virtual machines to host legacy apps (e.g. hosting a Win Forms based app). But this approach for me is inefficient unless all you want to do is transfer from on-premise to the cloud to perhaps get rid of system administration issues while ignoring other implications and issues like performance, latency, security, etc. You would have to figure out new issues as simple as backing up your data in the cloud. If your legacy app supports features available in on-premise environments, you might encounter problems running them in the cloud as current generation clouds do not support all features that are mostly staple in the on-premise environments.
  3. I have been designing on-premise apps for years, it is enough skill set to get me to the cloud. It is if you were designing on-premise apps using technologies and patterns similar to that available in the cloud. If not, you have to take a look at the cloud differently. If you want to write apps for the cloud, you have to think cloud. Scaling is probably the one issue that would force you to think cloud. If we are used to design our apps with a single instance mindset, it is about time to think differently and think multiple. Though of course, no one is stopping us to write apps to the cloud that way, but that is not real cloud apps are supposed to. Single instance design won’t scale even if you run it over current generation cloud. In SQL Azure, we have a 10GB database size limit. If we exceed the 10GB ceiling, what would our apps do? This seemingly easy to answer question could probably discourage anyone to embrace cloud.
  4. I just have to design apps like I use to and deploy it in the cloud, and the technologies behind it takes care of the rest. This can’t be. If you were into scale-out designs by now, chances are, you wouldn’t think of writing an app and deploying it over the cloud. Chances are, you might have deployed a cloud app already or you probably didn’t know that your design approaches are very much like those implemented over the cloud. But for most, I doubt that it’s the case. Most on-premise apps are designed not to scale-out but to scale-up. For example, most SQL Server implementations just take advantage of new hardware to scale as added load comes. Intrinsically, Microsoft SQL Server is not designed to automatically load balance, for example, a single query statement. One has to learn how to partition tables into a single view with underlying partitions spread across a federation of servers so SQL Server load balances a query. This is not supported in SQL Azure though. However, even if this is supported in SQL Azure in the future, this is not the scaling wonders the cloud provides, your application still sees only a single instance of a view. With today’s multi-million users accessing just one web application at one point in time, you can only scale as much. With cloud computing, your scaling possibilities are like crazy. You don’t have limits in terms of servers, of CPUs, of storage. Having resources like this would force you to rethink of the way you design your applications. The cloud isn’t about the technologies behind. For me it is one big design consideration. It is about the way you design your applications to take advantage of the unlimited resources found in the cloud.
  5. Any apps can or should be placed in the cloud. No one will stop you if you place anything regardless of how it will end up in the cloud. However, there could be applications that are better off off the cloud. For example, if you have an app with one massive synchronous process that you think has no way of being broken down into several pieces, you might as well stick to scaling-up on premise (unless your cloud offers hybrid services like allowing you to scale-up with a very powerful server). For apps requiring heavy informational security and if you aren’t comfortable with the notion that someone from your cloud provider might get into your highly secured data, you might as well have things on premise. There could be a lot of other applications that may not be practical in placed in the cloud.

It is very important that the cloud is understood clearly before everyone gets excited. While I am embracing it, I am welcoming it with a cautious attitude. Being relatively new, there are just too many considerations to think of, and too many questions that don’t even have clear answers for the time being.

**************************************************
Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com