Newton, MA (PressExposure) June 23, 2011 -- For a long time I planned to explain the ScaleBase Database Load Balancer sharding concept using MapReduce, and a recent post on UnbreakableCloud triggered me to do it. The post, titled "SQL MapReduce on Massively Parallel Processing Database Engines", explains how SQL engines leverage the MapReduce approach to process data intensive queries over shared nothing nodes, each with its own storage.
Well - that's what ScaleBase does. So let's dig deeper into what MapReduce is. According to Wikipedia, "MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers." There are two steps:
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.
Greenplum and AsterData are used as examples in the unbreakable cloud article as they are using MapReduce for their MPP (Massive Parallel Processing) database engines. But ScaleBase is a great example too. Basically, we run Map-Reduce on the SQL commands and not an SQL engine; but since SQL commands (especially under heavy load) can be complex and take a lot of time, doing Map-Reduce on them can be very helpful.
In a standard architecture with ScaleBase, there are several database servers with the entire application data split across them. Those database servers are standard RDBMS engines such as MySQL or Oracle. Each database engine has its own storage and data and is shared-nothing with the other database engines.
There are times when the ScaleBase Database Load Balancer engine directs an incoming SQL command to a single database node. That of course is not a Map-Reduce operation. However, at other times, the ScaleBase Database Load Balancer engine decides to run the command in parallel on several (or all) backend database servers, and then it aggregates the results arriving from the database servers into one meaningful result, which is then returned to the client.
In this case, the ScaleBase Database Load Balancer engine is performing the Map step. The "smaller sub-programs" are the SQL commands issued to the databases. In some cases, it's the same command, at other times it's a manipulated command. Think of the command select count(*) from employees - if employees is a split table, then the command runs on all databases, and the ScaleBase Database Load Balancer just needs to sum the counts. If a select average (salary) from employees is executed, then the load balancer will change the command to select sum (salary), count (salary) from employees and calculate the weighted average on the Database Load Balancer level.
The worker processes performing the bits of the work in parallel are the standard share-nothing databases, and the results aggregation done by the ScaleBase Database Load Balancer engine is the Reduce step.
To summarize, sharding splits a big data problem into a set of small problems. In such a scenario, the ScaleBase Database Load Balancer is the master node, and tries to optimize the process by giving the worker nodes as much work as possible.