Lecture 9 |
主要内容:
数据持久存放于以磁盘为代表的存储设备中,处理时需读入主存。磁盘和主存之间存在着巨大的访问速度鸿沟。讲授以块为单位的内外存数据传输、缓冲区管理与替换策略、记录在块中的存放方式,以及数据文件组织的主要形式。
作业12:书后习题 12.1、13.5、13.9、13.11
Chapter 12: Physical Storage Systems
Storage Hierarchy(存储级别)
硬盘和内存之间的交互是以块为单位的,即便只要1bit也会传输一大块内存
tertiary storage also called off-line storage
Magnetic Hard Disk Mechanism
磁盘满是机械运动,有大量限制
- Read-write head
- Positioned very close to the platter surface (almost touching it)
- Reads or writes magnetically encoded information
- Surface of platter divided into circular tracks(磁道)
- Over 50K-100K tracks per platter on typical hard disks
- Each track is divided into sectors(扇区).
- A sector is the smallest unit of data that can be read or written.
- Sector size typically 512 bytes
- Typical sectors per track: 500 to 1000 (on inner tracks) to 1000 to 2000 (on outer tracks)
- 扇区就是一个块
- Disk controller(磁盘控制器) – interfaces between the computer system and the disk drive hardware.
- checksums:用于检测是否正确写入扇区,就是一种校验码
Performance Measures of Disks
- Access time(访问时间) – the time it takes from when a read or write request is issued to when data transfer begins. Consists of:
- Seek time(寻道时间) – time it takes to reposition the arm over the correct track.
- Average seek time is 1/2 the worst case seek time.
- 4 to 10 milliseconds on typical disks
- Rotational latency (旋转延迟)– time it takes for the sector to be accessed to appear under the head.
- Average latency is 1/2 of the worst case latency.
- 4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.)
- Seek time(寻道时间) – time it takes to reposition the arm over the correct track.
- Data-transfer rate(数据传输率) – the rate at which data can be retrieved from or stored to the disk.
- 25 to 100 MB per second max rate, lower for inner tracks Multiple disks may share a controller, so rate that controller can handle is also important
顺序访问就是找到一个位置后顺着读取就完成了,随机访问就是读取多个不同位置
我们尽量避免后者,且尽量将后者转化为前者
- Disk block is a logical unit for storage allocation and retrieval
- 4 to 16 kilobytes typically
- Smaller blocks: more transfers from disk
- Larger blocks: more space wasted due to partially filled blocks
- 4 to 16 kilobytes typically
- Sequential access pattern(顺序访问模式)
- Successive requests are for successive disk blocks
- Disk seek required only for first block
- Random access pattern(随机访问模式)
- Successive requests are for blocks that can be anywhere on disk
- Each access requires a seek
- Transfer rates are low since a lot of time is wasted in seeks
- I/O operations per second (IOPS ,每秒I/O操作数)
- Number of random block reads that a disk can support per second 50 to 200 IOPS on current generation magnetic disks
还有可靠性,可能有故障
为了应对故障,数据库的复杂度翻了一倍
- Mean time to failure (MTTF,平均故障时间) – the average time the disk is expected to run continuously without any failure.
- Typically 3 to 5 years
- Probability of failure of new disks is quite low, corresponding to a “theoretical MTTF” of 500,000 to 1,200,000 hours (57 to 136 years)for a new disk
- An MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one will fail every 1200 hours(50 days)
- MTTF decreases as disk ages
Optimization of Disk-Block Access
- Buffering: in-memory buffer to cache disk blocks
- Read-ahead(Prefetch): Read extra blocks from a track in anticipation that they will be requested soon
- Disk-arm-scheduling algorithms re-order block requests so that disk arm movement is minimized
- 与controller相关
- elevator algorithm
-
File organization
- Allocate blocks of a file in as contiguous a manner as possible
- Allocation in units of extents(盘区)
- Files may get fragmented
- E.g., if free blocks on disk are scattered, and newly created file has its blocks scattered over the disk
- Sequential access to a fragmented file results in increased disk arm movement
- Some systems have utilities to defragment the file system, in order to speed up file access
-
Nonvolatile write buffers (非易失性写缓存) – speed up disk writes by writing blocks to a non-volatile RAM buffer immediately
- Non-volatile RAM: battery backed up RAM or flash memory
- Even if power fails, the data is safe and will be written to disk when power returns
- Controller then writes to disk whenever the disk has no other requests or request has been pending for some time
- Database operations that require data to be safely stored before continuing can continue without waiting for data to be written to disk
- Writes can be reordered to minimize disk arm movement
- Non-volatile RAM: battery backed up RAM or flash memory
-
Log disk(日志磁盘) – a disk devoted to writing a sequential log of block updates
- Used exactly like nonvolatile RAM
- Write to log disk is very fast since no seeks are required
- No need for special hardware (NV-RAM)
- Used exactly like nonvolatile RAM
Flash Storage
NAND flash - used widely for storage, cheaper than NOR flash, requires page-at-a-time read (page: 512 bytes to 4 KB)
SSD(Solid State Disks) :Use standard block-oriented disk interfaces, but store data on multiple flash storage devices internally
Solid State Disk可达上万IOPS
Erase happens in units of erase block
- Erase block typically 256 KB to 1 MB (128 to 256 pages)
Remapping of logical page addresses to physical page addresses avoids waiting for erase Flash translation table tracks mapping
- Remapping carried out by flash translation layer
wear leveling(磨损均衡)- evenly distributed erase operators across physical blocks
Storage Class Memory(NVM)
。。。。。。
Chapter 13: Data Storage Structures
数据库不是直接去管理上面说的硬盘,数据库引擎下面是OS,OS下面才是硬件
File Organization
The database is stored as a collection of files.
Each file is a sequence of records.
A record is a sequence of fields.
Fixed-Length Records
Store record i starting from byte n * (i – 1), where n is the size of each record.
Record access is simple but records may cross blocks
Modification: do not allow records to cross block boundaries
Deletion of record i
:
- move records
i + 1, . . ., n
toi, . . . , n – 1
- move record
n
toi
- do not move records, but link all free records on a free list
Variable-Length Records
Variable length attributes represented by fixed size (offset, length), with actual data stored after all fixed length attributes
Null values represented by null-value bitmap(空位图)
下图中,前面三个部分是length-fixed的,第一个数字定义position,第二个数字定义length
Slotted Page Structure
Slotted page(分槽页) header contains:
- number of record entries
- end of free space in the block
- location and size of each record
Records can be moved around within a page to keep them contiguous with no empty space between them; entry in the header must be updated.
Record pointers should not point directly to record — instead they should point to the entry for the record in header.
就如果在操作后records里面出现free space,就将移动record填充这个free space,相当于将free space都挤到前面去,集中在一起。
同时要更新record 的 entry
Organization of Records in Files
- Heap – record can be placed anywhere in the file where there is space
- Sequential – store records in sequential order, based on the value of the search key of each record
- In a multitable clustering file organization records of several different relations can be stored in the same file
- Motivation: store related records on the same block to minimize I/O
- B+-tree file organization ordered storage even with inserts/deletes
- Hashing – a hash function computed on search key; the result specifies in which block of the file the record should be placed
Heap File Organization
Records can be placed anywhere in the file where there is free space. Records usually do not move once allocated
Important to be able to efficiently find free space within file
Free-space map
Array with 1 entry per block.
Each entry is a few bits to a byte, and records fraction of block that is free
example
3 bits per block, value divided by 8 indicates
each entry stores maximum free space
Free space map written to disk periodically, OK to have wrong (old) values for some entries (will be detected and fixed)
Sequential File Organization
The records in the file are ordered by a search-key
Need to reorganize the file from time to time to restore sequential order
就定期重组一次
Multitable Clustering File Organization
Store several relations in one file
Can add pointer chains to link records of a particular relation
Table Partitioning
Records in a relation can be partitioned into smaller relations that are stored separately
E.g.,
transaction
relation may be partitioned intotrans_2018
,trans_2019
, etc
- Reduces costs of some operations such as free space management
- Allows different partitions to be stored on different storage devices
Data Dictionary Storage
The Data dictionary (also called system catalog) stores metadata
metadata:
- Information about relations
- User and accounting information, including passwords
- Statistical and descriptive data
- number of tuples in each relation
- Physical file organization information
- Information about indices
Relational Representation of System Metadata
Storage Access
一个块从硬盘读取后会暂时放在内存,确切说是放在缓冲区,以便短时间内的下次使用
程序会先通过buffer manager看内存里有没有,没有就先在buffer留个空间,然后从硬盘取出块放到buffer里
满了就腾出最远的未使用的块,LRU policy
We can reduce the number of disk accesses by keeping as many blocks as possible in main memory.
- Buffer – portion of main memory available to store copies of disk blocks.
- Buffer manager – subsystem responsible for allocating buffer space in main memory.
Buffer Manager in DBMS
LRU Example
LRU policy
Buffer Manager
Programs call on the buffer manager when they need a block from disk.
- If the block is already in the buffer, buffer manager returns the address of the block in main memory
- If the block is not in the buffer, the buffer manager
- Allocates space in the buffer for the block 1. Replacing (throwing out) some other block, if required, to make space for the new block. 2. Replaced block written back to disk only if it was modified since the most recent time that it was written to/fetched from the disk.
- Reads the block from the disk to the buffer, and returns the address of the block in main memory to requester