Skip to content

Lecture 10 | Indexing

:material-circle-edit-outline: 约 810 个字 :material-clock-time-two-outline: 预计阅读时间 3 分钟

Chapter 14: Indexing

Basic Concepts

Indexing mechanisms used to speed up access to desired data.

例如图书馆找书,根据书名首字母,按字母表排序找

  • Search Key - attribute to set of attributes used to look up records in a file

An index file consists of records (called index entries) of the form:

image-20240430090729914

Index files are typically much smaller than the original file

  • Two basic kinds of indices:
    • Ordered indices: search keys are stored in sorted order
      • 就像按首字母顺序找书
    • Hash indices: search keys are distributed uniformly across “buckets” using a “hash function”.
      • 哈希函数转化,根据search key计算出地址
      • 但是很难找到均匀的哈希函数
      • 所以还是用Ordered indice更多

Ordered Indices

Primary index& Secondary index

In an ordered index, index entries are stored sorted on the search key value.

  • Primary index(主索引): in a sequentially ordered file, the index whose search key specifies the sequential order of the file.
    • Also called clustering index(聚集索引)
    • The search key of a primary index is usually but not necessarily the primary key

即,源文件中的record已经按照某attribute的顺序排下来,这个attribute上的index就是primary index

search key次序与物理文件排放次序一样

最高效的一种

对一个物理文件,基本上最多只有一个主索引

  • Secondary index(辅助索引): an index whose search key specifies an order different from the sequential order of the file.
    • Also called non-clustering index.

search key次序与物理文件排放次序不一样

  • Index-sequential file(索引顺序文件): ordered sequential file with a primary index.

image-20240430092137123

image-20240430092209214

注意不是直接40000->record里的40000,而是中间多一个容器

这些容器包含所有属性值相同的record的地址,比如40000指向的容器包含所有属性值为40000的record地址

因为一个40000可能对应多个record

Dense Index Files & Sparse Index Files

  • Dense index(稠密索引) — Index record appears for every search-key value in the file.

每个record都有个索引项

image-20240430092848810

  • Sparse Index(稀疏索引): contains index records for only some search-key values.

查找方法类似B+树找leaf

image-20240430093026505

稀疏索引只能建立在主索引上

辅助索引很难更新

Multilevel Index(多级索引)

If primary index does not fit in memory, access becomes expensive.

Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it.

  • outer index – a sparse index of primary index
  • inner index – the primary index file

If even outer index is too large to fit in main memory, yet another level of index can be created, and so on.

Indices at all levels must be updated on insertion or deletion from the file.

不就是B+树嘛

B+-Tree Index Files

B+-tree index files are an alternative to indexed-sequential files.

image-20240430094011240

image-20240430093920902

image-20240430094051823

B+树每一个node占用disk的一个block,4096byte=4KiB】、

B+树的各类操作见ADS笔记,会考

B+- tree : height and size estimation

重要考点,计算空间大小

image-20240430101422188

注意小数是往上取还是往下取

  • 每个record的大小
  • 每个block可以放多少个record
  • n个record需要多少个block
  • fan-out是什么玩意不知道
    • 哦哦,是一个node多少个pointer,即多少阶

image-20240430103134532

注意叶子删掉,不用管内部节点,不然要再删除一次,与ADS讲的不一样

B+-Tree File Organization

之前讲的有堆,顺序,哈希

image-20240430104308076

image-20240430104312831

Other Issues in Indexing

image-20240430104440619

针对second key的难更新问题,我们可以这样

一个B+树 通过second key的属性值找对应的primary key

另一个B+树 通过primary key 找到物理地址

Indexing Strings

第三节课没听240429

ch14 - Indexing(4).pdf

image-20240430105146701

Bulk Loading and Bottom-Up Build

image-20240430105358062

Bottom-Up B+-tree Build

image-20240430110147372

image-20240430110525837

image-20240430111105376

image-20240430111111118

image-20240430111115360

image-20240430111121580

image-20240430111125486

image-20240430111129677

image-20240430111138238

image-20240430111142413