What is different to Titan?
There are few points that S2Graph differ from Titan.
1. Designed to use user provided id as vertex id.
Titan assign internal long id for vertex id to make sure the uniqueness on id.
In short, it is not efficient to get edge/vertex with ids that are primary key in other storage system like RDBMS. user have to access vertex using index first, then goes to actual vertex in Titan.
With S2Graph, users can access edge/vertex using their primary key directly.
S2Graph want to serve data with other data
In our case, we had user table and their primary key was user_id.
We want to fetch all friend`s user_id(not titan vertexId) since they use different system that is already use user_id as primary key.
In titan
1. get internal vertex id from looking up index(user_id).
2. get adjacent edges using internal vertexId from step 1.
3. get vertex property "user_id" on vertex that reside in each edge from step 2.
of course it is possible to store friends user_id as edge
s property to remove step 3, but we found that this is not intuitive, and if some user forget to add this property on label creation time, then, it is hard to rebuild this.
In S2Graph
1. get id for what this vertexId is about(in this case "user_id") from mysql. s2graph call this ServiceColumn.
2. build single get compositing serviceColumn id of "user_id"(from step 1) and actual vertexId 1 as rowKey.
3. once step 2 is finished, each edge`s target vertexId is "user_id" value of friends.
(since "friends" label defined as "user_id" and "user_id" relation)
S2Graph use local cache for step 1(default 1 minute) to increase performance. so basically, single Get operation in HBase retrieve all necessary data for our usecase(friend`s user_id).
Conceptually, S2Graph use different approach to make unique id for vertex. instead of converting into unique long internal id, it use user provided id, but expect what this vertexId is. s2graph abstract this as service
and serviceColumn
.
Checkout The Data Model for more details.
Here is comparison in picture.
2. Update/Delete specific edge performance.
Single edge delete/update in Titan is O(K) and S2Graph require O(logK), where K is number of adjacent edges
It is bad in performance wise, to iterate all adjacent edges only to find out edge id for Update/Delete, and I counldn`t find out way to avoid this in Titan.
In Titan
startV = g.V().has("user_id", 1).limit(1)
edge = startV.query().out("friends").has("user_id", 101).limit(1).edges()
Above query follow below step.
1. fetch index("user_id") to find Internal VertexId for fromVertex.
2. iterate all outgoing edges from VertexId with label "friends" and filter only edge that has 101 as "user_id" property value.
Note that this require O(K) where K is number of adjacent edges.
In S2Graph, Update/Delete require O(logK).
To understand how s2graph achieve this in details, take a look at apache con slide(page 37 ~ 45)
S2Graph use SnapshotEdge
structure that store most recent state between any given vertex pair. Using this extra data, find edge to delete/update becomes log(K).
1. fetch SnapshotEdge between user_id 1 and user_id 101.
2. build IndexedEdge to delete/insert with regarding to current request Edge(Delete/Update operation currently user issue).
3. mutate Indexed Edge according to updates from step 2.
In summary, S2Graph store extra data called SnapshotEdge
which exist only one between any given vertex pair, and store most recent state of edge between vertex pair. fetch SnapshotEdge
s is O(logK) since we can specify qualifier(target vertexId) on get request.
Based on fetched SnapshotEdge
, internal mutate process backtrack to ordered edges(which we call IndexedEdge
) by creating multiple mutations for HBase, and each mutations takes O(1).
3. Multiple Indices on Single Label.
Assume that we have friends
label but this friends
need to be sorted by multiple different indices(ex. multiple indices can be "created_at" and "updated_at").
In Titan, each label can have only one index. So user may create multiple labels like friends_by_created_at
, friends_by_updated_at
.
User have to maintain what labels are need to be mutated when someone add other as friends
, and have to keep multiple related
labels consistent. if any partial failure happen while mutating multiple related
labels, user have to rollback and retry.
By contrast, in S2Graph, friends
label have two index, first one is by "created_at", and second is by "updated_at". When user add new edge on friends
label, S2Graph make sure every related indices are updated correctly. any partial failure will be retried with maximum retry number 10 asynchronously.
S2Graph automatically keep strong consistency on multiple indices per each label with it`s own retry logic(default retry number 10) asynchronously using CheckAndSet HBase operation and if retry logic failed, then S2Graph produce this event to failed queue(currently only kafka is supported), so user can do whatever they want when resolving contention failed.
4. Idempotency
S2Graph validate user request with their request`s timestamp(use this as version).
This means when contention occurred on same edge(from, to, label, direction), S2Graph is smart enough to drop/apply requesets as user want when there is no gurantee on order of request delivery.
1418950524723 insert e 31111 101 graph_test {"time": 10, "weight": 20}
1418950524722 delete e 31111 101 graph_test
1418950524721 insert e 31111 101 graph_test {"weight": 10}
Above is example case that contention occurred and request is delivered in reversed order. logically expected state would be one edge exist with {"time": 10, "weight": 20}.
Regardless of order on request deliver, S2Graph yield same state at after it resolve contention internally.
This is done by storing every state on SnapshotEdge
with timestamp.
Fire same requests always yield same state so we call this Idempotency
(if you fire above requests again, then first request gives DuplicateTimestamp
exception and second, third request will give NoUpdateToApplyException
).
Idempotency
becomes very handy when it is hard to guarantee delivery order of client requests(usually distributed asynchronous process, such as queue). also thanks to Idempotency
, online migration from running system to S2Graph can be super simple.
5. Native support of number of total edges in O(1).
When insert/delete/update happen on edge, S2Graph internally keep degree value(count for number of adjacent edges).
S2Graph use Asynchbase
client and internally, it buffer increment request to HBase, then merge them before actually fire increment RPC on HBase. This reduce number of RPC to HBase when many adjacent edges are mutated in short period of time for same from vertex.
Count all outgoing/incoming edges for certain from vertex is O(1) since the degree count value is stored on very first cell.
In Titan, you have to iterate adjacent edges to count the number of edges for certain from vertex.
6. Use HBase`s Bulk Load functionality for data import.
In production, It is important to bulk load batch processed data into live cluster.
In our cases, we have more than 100 batch jobs to create similar user/item through matrix factorization(checkout mllib) for 20 services
At first time S2Graph simply used Put with Skip Wal
.
once data to import get larger and various(in our case it is more than 10 billion edges every day), Put request that goes through region server eat up resources for read request.
Usually, we observed that our read latency becomes double when we fire Put request for data import.
We tried to throttle(1 million put / sec on 40 HBase region servers), but it is only work around but not the fundamental solution.
So current version of S2Graph provide spark job to build HFile for hbase bulk load process.
By using hbase bulk load process, S2Graph can load large dataset into running production cluster without any penalties on performance.
7. Fully Asynchronous API.
All of operations that s2graph provide including write vertex/edge, quering vertex/edge, are asynchronous.
Check out last slide(49 page) on hbasecon. we compared native blocking client and Asynchbase performance and Asynchbase give 3.5x performance gain on read.
8. Multitenancy support.
S2Graph is designed to be multitenancy system. user can use their own HBase cluster and table by providing cluster
and hTableName
when they create their service
or label
.