In the first article of this series, I raised an issue considered to be one of sharding’s oddities. The issue raised was what would one do should a single tenant occupying a shard exceeds a shard’s capability (e.g. in terms of storage and computing power). The scenario I was referring in the first article was that I opted to choose “country” as my way of defining a tenant (or sharding key). In this iteration, I’ll once again attempt to share my thoughts on how I would approach the situation.
Off the bat, I’d probably blurt out the following when ask how to solve this issue:
- Increase the size of the shard
- Increase the computing power of the machine where the shard is situated
On in-premise sharding implementations, throwing in more hardware is easier to accomplish. However, doing the above suggestions when you are using SQL Azure, is easier said than done. Here is why:
- Microsoft limits SQL Azure’s database sizes to 1GB, 5GB, and 50GB chunks.
- The computing instance of where a shard can reside in SQL Azure is as finite as well
I have heard of unverified reports that Microsoft allows on a case-to-case basis to increase an SQL Azure’s database size to more than 50GB and probably situate a shard on some fine special rig. This however leads to a question on how much Microsoft allows each and every SQL Azure subscriber to avail of such special treatment. And it could probably cost one a fortune to get things done this way.
However, there are various ways to circumvent on the issue at hand without getting special treatment. One can also do the following:
- You can tell your tenant not to grow big and consume much computing power (Hey … Flickr does this. :P)
- You can probably shard a shard. Sometimes, things can really go complicated but anytime of the day, one can chop into pieces a shard. Besides, at this point, you could have probably eaten sharding for breakfast, lunch and dinner.
So how does one shard a shard?
In the first part of this series, I used as an example of sharding a database by Country. To refresh, here is an excerpt from the first article:
Server: ABC | |||
Database: DB1 | |||
Table: userfiles | |||
userfiles_pk | user | uploaded file | country_code |
1 | john | file1 | grmy |
2 | john | file2 | grmy |
6 | edu | file1 | grmy |
Server: ABC | |
Database: DB1 | |
Table: country | |
country_code | country |
grmy | germany |
can | canada |
ity | italy |
Server: CDE | |||
Database: DB1 | |||
Table: userfiles | |||
userfiles_pk | user | uploaded file | country_code |
3 | allan | file1 | can |
4 | allan | file2 | can |
5 | allan | file3 | can |
9 | jon | file1 | can |
10 | jon | file2 | can |
11 | jon | file3 | can |
Server: CDE | |
Database: DB1 | |
Table: country | |
country_code | country |
grmy | germany |
can | canada |
ity | italy |
In the sample above, the first shard contains data related only to grmy (germany) and the second shard contains data related only to can (canada). To break the shard further into pieces, one needs to find a another candidate key for sharding. If there is none, as in the case of our example, one should create one. We can probably think of splitting up a country by introducing regions from within (e.g. split by provinces, by cities, or by states). In this example, we can probably pick city as a our sharding key. To illustrate how, see the following shards:
Shard #1 | ||||
Server: ABC1 | ||||
Database: DB1 | ||||
Table: userfiles | ||||
userfiles_pk | User | uploaded file | country_code | city_code |
1 | John | file1 | grmy | berlin |
2 | John | file2 | grmy | berlin |
Shard #2 | ||||
Server: ABC2 | ||||
Database: DB1 | ||||
Table: userfiles | ||||
userfiles_pk | user | uploaded file | country_code | city_code |
6 | edu | file1 | grmy | hamburg |
By deciding to further subdivide a country by cities where each city becomes a shard, the following statements would be true:
- The new sharding key is now city_code.
- Our shard would only occupy data related to a city.
- Our shard would only occupy data related to a city.
- Various shards can be in the same server. Shards don’t need to be in separate servers.
- The increase in the number of shards would also increase the amount we spend on renting SQL Azure databases. According to Wikipedia, Germany alone have 2062 cities. This is some serious monthly spending that we have here. However this example is just for illustration purposes to convey the idea of sharding. One can always pick/create the most practical and cost-effective key for further sharding to address the issue of going beyond a shard’s capacity without the spending overhead due to poor design choices.
- At a certain point in the future, we might exceed a shard’s capacity once again breaking our design.
**************************************************
Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com