As we move towards wide adoption of NoSql databases, we often run into the need to consolidate our data on the shiny new NoSql platforms. Tons (several GB in our case) of data needs to move from SQL to NoSql. This presents an annoying but seeming simple problem. However the devil is in the details.
- Sql DBs have a mature set of tools for bulk import/export but not many tools exist for doing the same across SQL to NoSQL.
- Each app/NoSQL db combination usually has its own unique data model.
- The object/document/key-value data structures do not map to relation tables in a straightforward manner
- Salability semantics are different. SQL DBs will usually choke on large number of parallel operations but are typically faster at executing complicated queries
All this has to be taken into account to craft a solution which is fast, accurate and leverages strength of both the platforms.
After several iterations this is what I settled on and seems to work pretty well.
I leverage three pieces of infrastructure
- SQL DB cluster
- NoSQL DB cluster
- Grid computing infrastructure (Usually accompanies a NoSQL setup).
Keep load on the SQL DB as low as possible
Simple SQL: Ideally a vanilla select without any logic. Add an index to support the query. This allows the SQL DB to serve requests very quick. If you have to manipulate, do that on the grid job. Trade off SQL DB load for grid node load.
Leverage grid computing
Split the overall import into sub-tasks which can be farmed out to the grid. I typically break down by a range of Primary Keys. Example, for importing a million rows, i would split into 100 jobs importing 10,000 rows each. Each of these would read from the SQL db (using the sql described above) and write out to the NoSQL db. You could leverage transaction semantics if they make sense in your domain. Large number of concurrent writes is usually not a problem for NoSql DBs. So this is likely to be constrained by the number of these tasks the SQL db can concurrently support.
Other things to consider
If you have replication going on in either systems, its best to turn it off while the data migration happens. You can turn in on afterwards and let it catch up.
Actively monitor the load on the DBs, especially if you have a large number of grid nodes. 100’s or 1000’s of nodes can overwhelm DBs quicker than you would think.
Do a few trial runs with smaller datasets with detailed logging, Once you have that down switch to Error level logging only. I/O is expensive.