上级分类: Database technology

Document based databases

How should we use document based databases?


MarkLogic, MongoDB, couchbase are document databases.

They're often very fast.

But unfortunately compared to relational databases they often do not support efficient joins.


投票 (可选) (别通知) (可选)

您将 SQL 数据库与文档存储同步的想法类似于我将 SQL 数据库与快速键值存储的 dyanamodb 同步的想法。

我想要最好的 NoSQL 性能,但需要 SQL 连接的强大功能。

Your idea of synchronizing a SQL database with a document store is similar to my thought of synchronizing a SQL database with dyanamodb which is a fast keyvalue store.

I want the best of NoSQL performance but the power of SQL joins.

    : Mindey
    :  -- 
    :  -- 


我为 JSON 设计了一个键空间,它可以快速解码回 JSON,并且可以在 RocksDB 键值数据库范围扫描中快速扫描。


这个 JSON { “名称”:“塞缪尔·斯奎尔”,

“工作”: {

“当前工作”:{“公司”: {"employeeCount": 2500}}



{"_id": "1",




] } 至少变成以下keyvalue对象

0.0 = "塞缪尔乡绅" = "2500"

0.0 = "塞缪尔乡绅" = “上帝” = "数据库" = "多计算机系统"













"field people..1.": "LIST",


"field people.*.1": "爱好",

"现场人员。.1..0": "姓名",


"现场人员。*.2.0": "currentJob",

"现场人员。*.2.0.0": "公司",

"field people.*.": "employeeCount",



"field people.*.3": "words",

"field people..3.": "LIST",


"现场人员..3...": "LIST",






I've designed a keyspace for JSON that is fast to decode back into JSON and is fast to scan in a RocksDB keyvalue database range scan.

This lets us do a regular hash join as a relational database does.

This JSONs { "name": "Samuel Squire",

"job": {

"currentJob": {"company": {"employeeCount": 2500}}



{"_id": "1",

"name": "Samuel Squire",

"hobbies": [

{"name": "God"}, {"name": "databases"}, {"name":"multicomputer systems"}

] } Is turned into at least the following keyvalue objects

0.0 = "Samuel Squire" = "2500"

0.0 = "Samuel Squire" = "God" = "databases" = "multicomputer systems"

Essentially form a flat structure of the document with keys.

"type people": "object",

"type people.*": "list",

"type people.*.0": "string",

"type people.*.1": "list",

"type people..1..0": "string",

"type people..1.": "object",

"type people.*.2": "object",

"type people.*.2.0": "object",

"type people.*.2.0.0": "object",

"type people.*.": "number",

"field people.*": "LIST",

"field people..1.": "LIST",

"field people.*.0": "name",

"field people.*.1": "hobbies",

"field people..1..0": "name",

"field people.*.2": "job",

"field people.*.2.0": "currentJob",

"field people.*.2.0.0": "company",

"field people.*.": "employeeCount",

"field people": "people",

"field people.*": "LIST",

"field people.*.3": "words",

"field people..3.": "LIST",

"field people..3..*":"LIST",

"field people..3...": "LIST",

"type people.*.3": "list",

"type people..3.": "list",

"type people..3..*": "list",

"type people..3...": "list",

"type people..3....*": "number"

    : Mindey
    :  -- 
    :  -- 



首先,这个问题已经在 SQL 数据库中解决了,对吧?为什么不看一下实现,然后从那里开始呢?

假设我们将原始数据作为 JSON(或字典、哈希图)的记录。那么您关心的是高效查询,这是索引(查询优化或查询算法)的主题。我们例行地将 SQL 数据库索引到 ElasticSearch 中,因为 SQL 数据库在用户关心的其他方式的文本搜索中不够好或不够灵活:我们使用其他数据系统,擅长它,并在其中保留数据副本。不是很节省空间,但很有效。我们可以对 NoSQL 做同样的事情——如果你需要类似连接的查询——只需通过专门的进程将数据“索引”到 SQL 数据库中,该进程即时解释和迁移 SQL 数据库,作为与NoSQL,总是在寻找新的字段,并在补充的 SQL 数据库中创建这些字段。当然,一次使用多个数据库并不是一个优雅的解决方案,所以我同意,我们需要改进基于文档的数据库。毕竟,模式并非不存在,每条记录都暗示着某种模式,当足够多的记录共享某些字段时,它可能证明创建新的 SQL 字段或外键是合理的。把它想象成一个大脑,当一个人看到足够多的特定类型的例子时,它就会意识到新的“物理定律”......

The keyword is "efficient". Efficiency is inversely proportional to computational complexity, and so, I assume, you look for new algorithms for joins with unstructured data.

First, the problem is already solved in SQL databases, right? Why not to take a look at the implementation, and take it from there?

Let's say we have raw data as records of JSON (or dictionaries, hashmaps). What you're concerned about then is efficient querying, which is a subject of indexing (query-optimizing, or query algorithms). We routinely index SQL databases into ElasticSearch, because SQL databases are not good or not flexible enough in text search in other ways that users care about: we use another data system, that is good at it, and keep a copy of data in there. Not very space-saving, but works. We could do the same with NoSQL -- if you need join-like queries -- just "index" data into SQL databases, by specialized processes, that interprets and migrates SQL database on the fly, working as a complementary job in concert with the NoSQL, always looking for new fields, and creating those fields in the complementary SQL database. Sure, using many databases at once is not an elegant solution, so I agree, that we need improvement of document based databases. After all, schemas are not non-existent, every record implies a schema some sort, and when sufficiently many records share certain fields, it may justify creation of new SQL field or foreign key. Think of it like a brain that realizes new "laws of physics" when one sees sufficiently many examples of a specific type...


文档数据库的一个共同弱点是它们通常不支持有效的连接。它们将文档记录表示为不透明的 blob。


How to build a document based database that supports Joins?

A common weakness of document databases is that they often do not support efficient joins. They represent document records as an opaque blob.

I created the attached project to talk of how I plan to implement document based storage using keyvalue storage as the backend. Each key of the document is a separate key and they are arranged so that range scans are efficient so that hash joins can take place efficiently.

    : Bassxn2
    :  -- 
    :  --