Moving data from SQL to NoSQL

As we move towards wide adoption of NoSql databases, we often run into the need to consolidate our data on the shiny new NoSql platforms. Tons (several GB in our case) of data needs to move from SQL to NoSql. This presents an annoying but seeming simple problem. However the devil is in the details.

  1. Sql DBs have a mature set of tools for bulk import/export but not many tools exist for doing the same across SQL to NoSQL.
  2. Each app/NoSQL db combination usually has its own  unique data model.
  3. The object/document/key-value data structures do not map to relation tables in a straightforward manner
  4. Salability semantics are different. SQL DBs will usually choke on large number of parallel operations but are typically faster at executing complicated queries

All this has to be taken into account to  craft a solution which is fast, accurate and leverages strength of both the platforms.

After several iterations this is what I settled on and seems to work pretty well.

I leverage three pieces of infrastructure

  1. SQL DB cluster
  2. NoSQL DB cluster
  3. Grid computing infrastructure (Usually accompanies a NoSQL setup).

Solution overview

Image

Keep load on the SQL DB as low as possible

Simple SQL: Ideally a vanilla select without any logic.  Add an index to support the query. This allows the SQL DB to serve requests very quick. If you have to manipulate, do that on the grid job. Trade off SQL DB load for grid node load.

Leverage grid computing

Split the overall import into sub-tasks which can be farmed out to the grid. I typically break down by a range of Primary Keys. Example, for importing a million rows, i would split into 100 jobs importing 10,000 rows each. Each of these would read from the SQL db (using the sql described above) and write out to the NoSQL db. You could leverage transaction semantics if they make sense in your domain. Large number of concurrent writes is usually not a problem for NoSql DBs. So this is likely to be constrained by the number of these tasks the SQL db can concurrently support.

Other things to consider

If you have replication going on in either systems, its best to turn it off while the data migration happens. You can turn in on afterwards and let it catch up.

Monitoring

Actively monitor the load on the DBs, especially if you have a large number of grid nodes. 100’s or 1000’s of nodes can overwhelm DBs quicker than you would think.

Logging

Do a few trial runs with smaller datasets with detailed logging, Once you have that down switch to Error level logging only. I/O is expensive.

Advertisements
This entry was posted in Tech bits and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s