My First Shot At Sharding Databases

Sometime in 2004, I was faced with a question whether to have our library system’s design, which I started way back in 1992 using Clipper on Novell Netware and later was ported to ASP/MSSQL, be torn apart and come up with a more advanced, scalable and flexible design. The usual problem I would encounter most often is that sometimes in an academic organization, there could be varying structures. In public schools, they have regional structures where libraries are shared by various schools in the region. In some organizations, a school have sister schools with several campuses each with one or more libraries in it but managed by only one entity. In one setup, a single campus can have several schools in it, with each having one or more libraries. These variations pose a lot of challenge in terms of programming and deployment. A possible design nightmare. Each school or library would often emphasize their independence and uniqueness against other schools and libraries, for example wanting to have their own library policies and control over their collections and users and customers and yet have that desire to share their resources to others and interoperate with one another. Even within inside a campus, one library can even operate on a different time schedule from the other library just a hallway apart. That presented a lot of challenge in terms of having a sound database design.

The old design from 1992 to 2004 was a single database with most tables have references to common tables called “libraries” and “organizations”. That was an attempt to partition content by libraries or organization (a school). Scaling though wasn’t a concern that time as even the largest library in the country won’t probably consume a few gigs of harddisk space. The challenge came as every query inside the system has to filter everything by library or school. As features of our library system grew in numbers and became more advanced and complex, it is apparent that the old design, though very simple when contained in a single database, would soon burst into a problem. Coincidentally though, I have been contemplating to advance the product in terms of feature set. Flexibility was my number one motivator, second was the possibility of doing it all over the web.

Then came the ultimate question, should I retain the design and improve on it, or should I be more daring and ambitious. I scoured over the Internet for guidance of a sound design and after a thorough assessment of current and possibly future challenges that would include scaling, I ended up with a decision to instead break things apart and abandon the single database mindset. The old design went into my garbage bin. Consequently, that was the beginning of my love of sharding databases to address issues of library organization, manageability and control and to some extent, scalability.

The immediate question was how I am gonna do the sharding. Picking up the old schema from the garbage bin, it was pretty obvious that breaking them apart by libraries is the most logical. I haven’t heard the concept of a “tenant” then, but I dont have to as the logic behind in choosing it is as ancient as it can be. There were other potential candidate for keys to shard the database like “schools” or “organization”, but the most logical is the “library”. It is the only entity that can stand drubbing and scrubbing. I went on to design our library system with each database containing only one tenant, the library. As of this writing, our library system have various configurations: one school have several libraries inside their campus, another have several campuses scattered all over metro manila with some campus having one or more libraries but everything sits on a VPN accessing a single server.

Our design though is yet to become fully sharded at all levels as another system acts as a common control for all the databases. This violates the concept of a truly sharded design where there should be no shared entity among shards. Few tweaks here and there though would fully comply with the concept. Our current design though is 100% sharded at the library level.

So Why Sharding?

The advent of computing in the cloud present to us new opportunities, especially with ISVs. With things like Azure, we will be forced to rethink our design patterns. The most challenging perhaps is on how to design not only to address concerns of scalability, but to make our applications tenant-aware and tenant-ready. This challenge is not only present in the cloud, but a lot of on-premise applications can be designed this way. This could help in everyone’s eventual transition to the cloud. But cloud or not, we could benefit a lot on sharding. In our case, we can pretty much support any configuration out there. We also got to mimic the real world operation of libraries. And it eases up on a lot of things like security and control.

Developing Sharded Applications

Aside from databases, applications need to be fully aware that it is not anymore accessing a single database where it can easily query everything with ease without minding other data exists somewhere. Could be on a different database, sitting on another server. Though the application will be a bit more complex in terms of design, often, it is easier to comprehend and develop if you have an app instance mind only a single tenant as oppose to an app instance trying to filter out other tenants just to get the information set of just one tenant.

Our Library System

Currently our library system runs on sharded mode both on premise and on cloud-like hosted environments. You might want to try its online search:

Looking at SQL Azure

Sharding isn’t automatic to any Microsoft SQL Server platform including SQL Azure. One needs to do it by hand and from ground up. This might change in the future though. I am quite sure Microsoft will see this compelling feature. SQL Azure is the only Azure based product that currently does not have natural/inherent support for scaling out.  If I am Microsoft, they should offer a full SQL Server service like shared windows hosting sites do along side SQL Azure so it eases up adoption. Our systems database design readiness (being currently sharded) would allow us to easily embrace the new service. But I understand, it would affect, possibly dampen their SQL Azure efforts if they do it. But I would try to reconsider it than offering a very anemic product.

As of now, though we may have wanted to take our library system to Azure with few minor tweaks, we just can’t in this version of SQL Azure for various reasons as stated below:

  • SQL Server Full Text Search. SQL Azure does not support this in its current version.
  • Database Backup and Restore. SQL Azure does not support this in its current version.
  • Reporting Services. SQL Azure does not support this in its current version.
  • $109.95 a month on Azure/SQL Azure versus $10 a month shared host with a full-featured IIS7/SQL Server 2008 DB

My Design Paid Off

Now I am quite happy that the potentials of a multi-tenant, sharded database design, though is as ancient, it is beginning to get attention with the advent of cloud computing. My 2004 database design is definitely laying the groundwork for cloud computing adoption. Meanwhile, I have to look for solutions to address what’s lacking in SQL Azure. There could be some easy work around.

I’ll find time on the technical aspects of sharding databases in my future blogs. I am also thinking that PHISSUG should have one of this sharding tech-sessions.

**************************************************
Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com

Advertisement

Speed Matters : Subquery vs Table Variable vs Temporary Table

This has been perhaps written quite a number of times by a lot of TSQL gurus out there but earlier I was presented with a TSQL code that prompted me to recheck my notions on subqueries, table variables and temporary tables. The TSQL code presented to me is something built dynamically that depends heavily on user inputs. The issue presented was performance. The code used to run acceptably fine until earlier when things are annoyingly slow.

Upon seeing the code, I said to myself, problem solved. Instantly, I thought I was able to spot what is causing the performance issue. The code was written using TSQL’s temporary tables. Based from experience, I would automatically shoot down any attempt to use temporary tables when executing a TSQL batch. And though Microsoft has introduced table variables as a better alternative to temporary tables, better in terms of technicality and programmability, delving with the two options would always be my last resort. In most cases, I’d rather tackle difficult scenarios requiring handling of temporal data using subqueries. In fact, I immediately ruled a rewrite of the code to minimize the use of temporary tables, and at least replace it with table variables which were supposed to be a lot better. But upon further investigation, I ruled a rewrite because of some other things not related to the topics at hand here. On the other hand, am trying to figure out how things can be converted using subqueries which I would primarily prefer.

In my investigation though, something interesting came up and I thought I’d better time the executions of my experiment while I make the templates of how things are done using subqueries. The results now intrigues me a bit, and probably would take a second look at the code I was shown earlier. I might be looking at the wrong place. Nevertheless, the results of my experiment might interest others out there. So here it is:

The Experiment 

I always love to use one of our company’s product, the library system, to test my experiments with anything to do with TSQL as its database is as diverse in design and available test data is in great abundance. I picked one small database sample for my experiment and this is how it went:

  • Test 1 is intended to check how different approaches fare on small sets of data. I have to query all library materials that were written in ‘Filipino’ (which would result into 974 rows out of the 895701 rows in a single table). Based on the result, I have to search for a string pattern “bala” in all of the materials’ titles using the LIKE operator which would eventually result to 7 rows.
  • Test 2 is basically the same query against the same environment with different query conditions to see how the three approaches fare on large sets of data. So I queried all library materials that were written in ‘English’ (which would result into 43896 rows out of the 895701 rows). Based on the result, I have to search for a string pattern “life” in all of the materials’ titles using the LIKE operator which would eventually result to 440 rows.

I wrote three queries that each would use a specific approach to query my desired output in this experiment.

First query using SUB QUERIES:

Second query uses TABLE VARIABLES:

The third and last query uses TEMPORARY TABLES:

I run all three queries (3x each) after a service restart to clean-up tempdb and I got the following results. Figures are presented in the number of seconds:

  Total Rows Temporal Data Count Final Result Sub Queries Table Variables Temporary Tables
Test 1 895701 rows 974 rows 7 rows 2 secs 4 secs 2 secs
Test 2 895701 rows 43896 rows 440 rows 3 secs 23 secs 4 secs

 

The results pretty much proved interesting. I thought, based on the queries I wrote that temporary tables would be slightly slower of the three approaches. The execution plan for table variables and temporary tables are pretty much the same. I was a bit surprised that the execution time of the table variable is almost twice that of a temporary table on small sets of data but temporary tables trumps table variables by a large margin in large set of data. I was expecting a slight difference only as, though both are a bit different in terms of scoping and technicality, and to some extent purpose, both uses TEMPDB as their memory (contrary to the notion that table variables use in-memory) when handling data. What is notable though is that, disk I/O is costlier with Table Variables. I am not sure at this point what causes this cost. Will try to dig deeper when I have more time.

Nevertheless, this particular experiment does not provide a general conclusion that one of the three approaches is better than among the three. I still subscribe to the general notions:

  • That sub queries are neat and would perform better in most if not all cases
  • That temporary tables would be slower on environments with heavy concurrency as sql server would handle more locks especially when using it within a transaction. I have not tested this scenario here though.
  • That temporary tables may be a faster solution if constraints/indexes are needed in between queries, especially when a large set of data is expected, statistics on table variables columns arent created and it can’t have indexes. But surprisingly, this experiment shows Table Variables performed poorly among the three with handling large data set as being its slowest.
  • That table variables are a good option technically/programmatically

You just be the judge. It would be prudent to conduct tests. I can’t say my experiment here is conclusive at best, but from hereon, I would avoid using Table Variables whenever I can. For very simple queries like in the tests, it performed poorly. I can’t imagine how things will be with larger datasets and more complex queries.

I am on to my next experiment!

**************************************************
Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com