Lecture 9 |

:material-circle-edit-outline: 约 2018 个字 :material-clock-time-two-outline: 预计阅读时间 7 分钟

主要内容：

数据持久存放于以磁盘为代表的存储设备中，处理时需读入主存。磁盘和主存之间存在着巨大的访问速度鸿沟。讲授以块为单位的内外存数据传输、缓冲区管理与替换策略、记录在块中的存放方式，以及数据文件组织的主要形式。

作业12：书后习题 12.1、13.5、13.9、13.11

Chapter 12: Physical Storage Systems

Storage Hierarchy（存储级别）

硬盘和内存之间的交互是以块为单位的，即便只要1bit也会传输一大块内存

tertiary storage also called off-line storage

Magnetic Hard Disk Mechanism

磁盘满是机械运动，有大量限制

Read-write head
- Positioned very close to the platter surface (almost touching it)
- Reads or writes magnetically encoded information
Surface of platter divided into circular tracks（磁道）
- Over 50K-100K tracks per platter on typical hard disks
Each track is divided into sectors（扇区）.
- A sector is the smallest unit of data that can be read or written.
- Sector size typically 512 bytes
- Typical sectors per track: 500 to 1000 (on inner tracks) to 1000 to 2000 (on outer tracks)
- 扇区就是一个块
Disk controller(磁盘控制器) – interfaces between the computer system and the disk drive hardware.
- checksums：用于检测是否正确写入扇区，就是一种校验码

Performance Measures of Disks

Access time(访问时间) – the time it takes from when a read or write request is issued to when data transfer begins. Consists of:
- Seek time（寻道时间） – time it takes to reposition the arm over the correct track.
  - Average seek time is 1/2 the worst case seek time.
  - 4 to 10 milliseconds on typical disks
- Rotational latency （旋转延迟）– time it takes for the sector to be accessed to appear under the head.
  - Average latency is 1/2 of the worst case latency.
  - 4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.)
Data-transfer rate（数据传输率） – the rate at which data can be retrieved from or stored to the disk.
- 25 to 100 MB per second max rate, lower for inner tracks Multiple disks may share a controller, so rate that controller can handle is also important 

顺序访问就是找到一个位置后顺着读取就完成了，随机访问就是读取多个不同位置

我们尽量避免后者，且尽量将后者转化为前者

Disk block is a logical unit for storage allocation and retrieval
- 4 to 16 kilobytes typically
  - Smaller blocks: more transfers from disk
  - Larger blocks: more space wasted due to partially filled blocks

Sequential access pattern(顺序访问模式)
- Successive requests are for successive disk blocks
- Disk seek required only for first block
Random access pattern（随机访问模式）
- Successive requests are for blocks that can be anywhere on disk
- Each access requires a seek
- Transfer rates are low since a lot of time is wasted in seeks
I/O operations per second (IOPS ，每秒I/O操作数)
- Number of random block reads that a disk can support per second 50 to 200 IOPS on current generation magnetic disks

还有可靠性，可能有故障

为了应对故障，数据库的复杂度翻了一倍

Mean time to failure (MTTF，平均故障时间) – the average time the disk is expected to run continuously without any failure.
- Typically 3 to 5 years
- Probability of failure of new disks is quite low, corresponding to a “theoretical MTTF” of 500,000 to 1,200,000 hours （57 to 136 years）for a new disk
- An MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one will fail every 1200 hours(50 days)
- MTTF decreases as disk ages

Optimization of Disk-Block Access

Buffering: in-memory buffer to cache disk blocks
Read-ahead(Prefetch): Read extra blocks from a track in anticipation that they will be requested soon
Disk-arm-scheduling algorithms re-order block requests so that disk arm movement is minimized
- 与controller相关
elevator algorithm

File organization
- Allocate blocks of a file in as contiguous a manner as possible
- Allocation in units of extents(盘区）
- Files may get fragmented
  - E.g., if free blocks on disk are scattered, and newly created file has its blocks scattered over the disk
  - Sequential access to a fragmented file results in increased disk arm movement
  - Some systems have utilities to defragment the file system, in order to speed up file access
Nonvolatile write buffers （非易失性写缓存） – speed up disk writes by writing blocks to a non-volatile RAM buffer immediately
- Non-volatile RAM: battery backed up RAM or flash memory 
  - Even if power fails, the data is safe and will be written to disk when power returns
- Controller then writes to disk whenever the disk has no other requests or request has been pending for some time
- Database operations that require data to be safely stored before continuing can continue without waiting for data to be written to disk
- Writes can be reordered to minimize disk arm movement
Log disk（日志磁盘） – a disk devoted to writing a sequential log of block updates
- Used exactly like nonvolatile RAM
  - Write to log disk is very fast since no seeks are required
  - No need for special hardware (NV-RAM)

Flash Storage

NAND flash - used widely for storage, cheaper than NOR flash, requires page-at-a-time read (page: 512 bytes to 4 KB)

SSD(Solid State Disks) ：Use standard block-oriented disk interfaces, but store data on multiple flash storage devices internally

Solid State Disk可达上万IOPS

Erase happens in units of erase block

Erase block typically 256 KB to 1 MB (128 to 256 pages)

Remapping of logical page addresses to physical page addresses avoids waiting for erase Flash translation table tracks mapping

Remapping carried out by flash translation layer

wear leveling(磨损均衡)- evenly distributed erase operators across physical blocks

Storage Class Memory(NVM)

。。。。。。

Chapter 13: Data Storage Structures

数据库不是直接去管理上面说的硬盘，数据库引擎下面是OS，OS下面才是硬件

File Organization

The database is stored as a collection of files.

Each file is a sequence of records.

A record is a sequence of fields.

Fixed-Length Records

Store record i starting from byte n * (i – 1), where n is the size of each record.

Record access is simple but records may cross blocks

Modification: do not allow records to cross block boundaries

Deletion of record i:

move records i + 1, . . ., n toi, . . . , n – 1
move record n to i
do not move records, but link all free records on a free list

Variable-Length Records

Variable length attributes represented by fixed size (offset, length), with actual data stored after all fixed length attributes

Null values represented by null-value bitmap（空位图）

下图中，前面三个部分是length-fixed的，第一个数字定义position，第二个数字定义length

Slotted Page Structure

Slotted page（分槽页） header contains:

number of record entries
end of free space in the block
location and size of each record

Records can be moved around within a page to keep them contiguous with no empty space between them; entry in the header must be updated.

Record pointers should not point directly to record — instead they should point to the entry for the record in header.

就如果在操作后records里面出现free space，就将移动record填充这个free space，相当于将free space都挤到前面去，集中在一起。

同时要更新record 的 entry

Organization of Records in Files

Heap – record can be placed anywhere in the file where there is space
Sequential – store records in sequential order, based on the value of the search key of each record
In a multitable clustering file organization records of several different relations can be stored in the same file
- Motivation: store related records on the same block to minimize I/O
B+-tree file organization ordered storage even with inserts/deletes
Hashing – a hash function computed on search key; the result specifies in which block of the file the record should be placed

Heap File Organization

Records can be placed anywhere in the file where there is free space. Records usually do not move once allocated

Important to be able to efficiently find free space within file

Free-space map

Array with 1 entry per block.

Each entry is a few bits to a byte, and records fraction of block that is free

example

3 bits per block, value divided by 8 indicates

each entry stores maximum free space

Free space map written to disk periodically, OK to have wrong (old) values for some entries (will be detected and fixed)

Sequential File Organization

The records in the file are ordered by a search-key

Need to reorganize the file from time to time to restore sequential order

就定期重组一次

Multitable Clustering File Organization

Store several relations in one file

Can add pointer chains to link records of a particular relation

Table Partitioning

Records in a relation can be partitioned into smaller relations that are stored separately

E.g., transactionrelation may be partitioned into trans_2018, trans_2019, etc

Reduces costs of some operations such as free space management
Allows different partitions to be stored on different storage devices

Data Dictionary Storage

The Data dictionary (also called system catalog) stores metadata

metadata:

Information about relations
User and accounting information, including passwords
Statistical and descriptive data
- number of tuples in each relation
Physical file organization information
Information about indices

Relational Representation of System Metadata

Storage Access

一个块从硬盘读取后会暂时放在内存，确切说是放在缓冲区，以便短时间内的下次使用

程序会先通过buffer manager看内存里有没有，没有就先在buffer留个空间，然后从硬盘取出块放到buffer里

满了就腾出最远的未使用的块，LRU policy

We can reduce the number of disk accesses by keeping as many blocks as possible in main memory.

Buffer – portion of main memory available to store copies of disk blocks.
Buffer manager – subsystem responsible for allocating buffer space in main memory.

Buffer Manager in DBMS

LRU Example

LRU policy

Buffer Manager

Programs call on the buffer manager when they need a block from disk.

If the block is already in the buffer, buffer manager returns the address of the block in main memory
If the block is not in the buffer, the buffer manager
1. Allocates space in the buffer for the block 1. Replacing (throwing out) some other block, if required, to make space for the new block. 2. Replaced block written back to disk only if it was modified since the most recent time that it was written to/fetched from the disk.
2. Reads the block from the disk to the buffer, and returns the address of the block in main memory to requester