nosql |

In the first article of this series, I raised an issue considered to be one of sharding’s oddities. The issue raised was what would one do should a single tenant occupying a shard exceeds a shard’s capability (e.g. in terms of storage and computing power). The scenario I was referring in the first article was that I opted to choose “country” as my way of defining a tenant (or sharding key). In this iteration, I’ll once again attempt to share my thoughts on how I would approach the situation.

Off the bat, I’d probably blurt out the following when ask how to solve this issue:

Increase the size of the shard
Increase the computing power of the machine where the shard is situated

On in-premise sharding implementations, throwing in more hardware is easier to accomplish. However, doing the above suggestions when you are using SQL Azure, is easier said than done. Here is why:

Microsoft limits SQL Azure’s database sizes to 1GB, 5GB, and 50GB chunks.
The computing instance of where a shard can reside in SQL Azure is as finite as well

I have heard of unverified reports that Microsoft allows on a case-to-case basis to increase an SQL Azure’s database size to more than 50GB and probably situate a shard on some fine special rig. This however leads to a question on how much Microsoft allows each and every SQL Azure subscriber to avail of such special treatment. And it could probably cost one a fortune to get things done this way.

However, there are various ways to circumvent on the issue at hand without getting special treatment. One can also do the following:

You can tell your tenant not to grow big and consume much computing power (Hey … Flickr does this. :P)
You can probably shard a shard. Sometimes, things can really go complicated but anytime of the day, one can chop into pieces a shard. Besides, at this point, you could have probably eaten sharding for breakfast, lunch and dinner.

So how does one shard a shard?

In the first part of this series, I used as an example of sharding a database by Country. To refresh, here is an excerpt from the first article:

Server: ABC
Database: DB1
Table: userfiles
userfiles_pk	user	uploaded file	country_code
1	john	file1	grmy
2	john	file2	grmy
6	edu	file1	grmy

Server: ABC
Database: DB1
Table: country
country_code	country
grmy	germany
can	canada
ity	italy

Server: CDE
Database: DB1
Table: userfiles
userfiles_pk	user	uploaded file	country_code
3	allan	file1	can
4	allan	file2	can
5	allan	file3	can
9	jon	file1	can
10	jon	file2	can
11	jon	file3	can

Server: CDE
Database: DB1
Table: country
country_code	country
grmy	germany
can	canada
ity	italy

In the sample above, the first shard contains data related only to grmy (germany) and the second shard contains data related only to can (canada). To break the shard further into pieces, one needs to find a another candidate key for sharding. If there is none, as in the case of our example, one should create one. We can probably think of splitting up a country by introducing regions from within (e.g. split by provinces, by cities, or by states). In this example, we can probably pick city as a our sharding key. To illustrate how, see the following shards:

Shard #1
Server: ABC1
Database: DB1
Table: userfiles
userfiles_pk	User	uploaded file	country_code	city_code
1	John	file1	grmy	berlin
2	John	file2	grmy	berlin

Shard #2
Server: ABC2
Database: DB1
Table: userfiles
userfiles_pk	user	uploaded file	country_code	city_code
6	edu	file1	grmy	hamburg

By deciding to further subdivide a country by cities where each city becomes a shard, the following statements would be true:

The new sharding key is now city_code.
Our shard would only occupy data related to a city.
Our shard would only occupy data related to a city.
Various shards can be in the same server. Shards don’t need to be in separate servers.
The increase in the number of shards would also increase the amount we spend on renting SQL Azure databases. According to Wikipedia, Germany alone have 2062 cities. This is some serious monthly spending that we have here. However this example is just for illustration purposes to convey the idea of sharding. One can always pick/create the most practical and cost-effective key for further sharding to address the issue of going beyond a shard’s capacity without the spending overhead due to poor design choices.
At a certain point in the future, we might exceed a shard’s capacity once again breaking our design.

**************************************************
Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com

In the first article, I showed an example how a database’s design could impact us technically and financially. And sharding isn’t all just about splitting up data. It also brings to the table a group of terrible monsters to slay. There are a lot of concerns that needs to be considered when one attempts to shard a database, especially in SQL Azure.

NoSQL, NoRel, NoACID

In breaking things apart, one is bordering on clashing religions. One monster to slay is the issue of ACIDity. People discuss NoSQL, NoRel, NoACID to be one of the trends out there. And most even swear to the fact that these approaches are better than SQL. In my case, I prefer to call it NoACID and it is not by any means more or less than SQL. I have NoACID implementations on some projects I had. And I love SQL. To simplify, I’ll put in these trends in a NoX lump as they commonly attempt to disengage with the realities of SQL.

For me, NoX is not a religion, it is simply a requirement. The nature of the app you build will dictate if you need to comply to the principles of ACID (Atomicity, Consistency, Isolation, Durability). If ACID is required, it is required regardless of your data and storage engine or your prefered religion. If it is required, you have to support it. Most cloud apps that we see, like Google and Facebook, could probably have ACID to be absent in their requirements list. Google is primarily read only so it does make sense to have data scattered all over various servers in all continents without the need for ACID. By nature, ACID in this regard, can be very minimal or absent. Facebook on the otherhand is read/write intensive. Seems like it is driven by a massive highly sophisticated message queuing engine. Would ACID be required in Facebook? I am not quite sure about Facebook’s implementation but the way I look at it, ACID can be optional. ACID can well be present in operations concerned only to one tenant in case of an FB account. Outside of this, the absence of ACID could probably be compensated by queuing and data synching.

If Facebook and Google decided to require ACID, they could be facing concerns on locking a lot of things. While locked on, latency could be one of the consequences. It is therefore very important to lay out firsthand if ACID is a requirement or not. For a heavy transactional system, a sharded design presents a lot of obstacles to hurdle. In SQL Azure, this is even harder as SQL Azure does not support distributed transactions like we used to with SQL Server. This means, if your transaction spans across multiple shards, there is no simple way to do it as SQL Azure does not support it, thus ACID can be compromised. SQL Azure however does support local transactions. This means you can definitely perform ACIDic operations within a shard.

To be continued…

Where Wildlife, Photography, Food & Databases Mix

Tag Archives: nosql