In many cases, the first step to start using S2Graph in production is to migrate a large dataset into S2Graph. S2Graph provides a bulk loading script for importing the initial dataset.
To use bulk load, you need a running Spark cluster and TSV file that follows the S2Graph bulk load format.
Note that if you don't need additional properties on vertices(i.e., you only need vertex id), you only need to publish the edges and not the vertices. Publishing edges will create vertices with empty properties by default.
S2Graph has great property(idempotent) on strong consistencyLevel label.
This property can be used for online migration from running database(ex: production RDBMS) into s2graph without any even missing, once user guarantee all events goes to RDBMS goes also into S2Graph.
The following explains how to run an online migration from RDBMS to S2Graph. assumes that client send same events that goes to primary storage(RDBMS) and S2Graph.
- Set label as isAsync "true". This will redirect all operations to a Kafka queue.
- Dump your RDBMS data to a TSV for bulk loading.
- Load TSV file with subscriber.GraphSubscriber.
- Set label as isAsync "false". This will stop queuing events into Kafka and make changes into S2Graph directly.
- Run subscriber.GraphSubscriberStreaming to queued events. Because the data in S2Graph is idempotent, it is safe to replay queued message while bulk load is still in process.
Following is overview of how to do online-migration.
Start ETL, and toggle flag for flusing
note that by changing "isAsync" on labels that user is migrating, S2Graph REST start queueing into kafka rather than flusing events into HBase.
After ETL, Replay
after ETL process finished, toggle "isAsync" back so S2Graph REST flush into HBase wright away. then start replay queued events in kafka. Now This is why idempotency is important.
because of itempotency, it is safe to replay old events while current events is currently flusing into HBase.