1. 12 Oct, 2018 1 commit
  2. 03 Oct, 2018 1 commit
  3. 06 Sep, 2018 1 commit
  4. 30 Jul, 2018 1 commit
    • Andy Grover's avatar
      dm thin: include metadata_low_watermark threshold in pool status · 63c8ecb6
      Andy Grover authored
      The metadata low watermark threshold is set by the kernel.  But the
      kernel depends on userspace to extend the thinpool metadata device when
      the threshold is crossed.
      Since the metadata low watermark threshold is not visible to userspace,
      upon receiving an event, userspace cannot tell that the kernel wants the
      metadata device extended, instead of some other eventing condition.
      Making it visible (but not settable) enables userspace to affirmatively
      know the kernel is asking for a metadata device extension, by comparing
      metadata_low_watermark against nr_free_blocks_metadata, also reported in
      Current solutions like dmeventd have their own thresholds for extending
      the data and metadata devices, and both devices are checked against
      their thresholds on each event.  This lessens the value of the kernel-set
      threshold, since userspace will either extend the metadata device sooner,
      when receiving another event; or will receive the metadata lowater event
      and do nothing, if dmeventd's threshold is less than the kernel's.
      (This second case is dangerous. The metadata lowater event will not be
      re-sent, so no further event will be generated before the metadata
      device is out if space, unless some other event causes userspace to
      recheck its thresholds.)
      Signed-off-by: default avatarAndy Grover <agrover@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
  5. 27 Jul, 2018 3 commits
  6. 02 Jul, 2018 1 commit
  7. 08 Jun, 2018 1 commit
    • Mikulas Patocka's avatar
      dm: add writecache target · 48debafe
      Mikulas Patocka authored
      The writecache target caches writes on persistent memory or SSD.
      It is intended for databases or other programs that need extremely low
      commit latency.
      The writecache target doesn't cache reads because reads are supposed to
      be cached in page cache in normal RAM.
      If persistent memory isn't available this target can still be used in
      SSD mode.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: Colin Ian King <colin.king@canonical.com> # fix missing goto
      Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> # fix compilation issue with !DAX
      Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # use msecs_to_jiffies
      Acked-by: Dan Williams <dan.j.williams@intel.com> # reworks to unify ARM and x86 flushing
      Signed-off-by: default avatarMike Snitzer <msnitzer@redhat.com>
  8. 10 May, 2018 1 commit
  9. 03 Apr, 2018 1 commit
    • Patrik Torstensson's avatar
      dm verity: add 'check_at_most_once' option to only validate hashes once · 843f38d3
      Patrik Torstensson authored
      This allows platforms that are CPU/memory contrained to verify data
      blocks only the first time they are read from the data device, rather
      than every time.  As such, it provides a reduced level of security
      because only offline tampering of the data device's content will be
      detected, not online tampering.
      Hash blocks are still verified each time they are read from the hash
      device, since verification of hash blocks is less performance critical
      than data blocks, and a hash block will not be verified any more after
      all the data blocks it covers have been verified anyway.
      This option introduces a bitset that is used to check if a block has
      been validated before or not.  A block can be validated more than once
      as there is no thread protection for the bitset.
      These changes were developed and tested on entry-level Android Go
      Signed-off-by: default avatarPatrik Torstensson <totte@google.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
  10. 30 Jan, 2018 1 commit
  11. 17 Jan, 2018 9 commits
  12. 13 Dec, 2017 1 commit
    • Heinz Mauelshagen's avatar
      dm raid: stop keeping raid set frozen altogether · 11e47232
      Heinz Mauelshagen authored
      In order to avoid redoing synchronization/recovery/reshape partially,
      the raid set got frozen until after all passed in table line flags had
      been cleared.  The related table reload sequence had to be precisely
      followed, or reshaping may lead to data corruption caused by the active
      mapping carrying on with a reshape when the inactive mapping already
      had retrieved a stale reshape position.
      Harden by retrieving the actual resync/recovery/reshape position
      during resume whilst the active table is suspended thus avoiding
      to keep the raid set frozen altogether.  This prevents superfluous
      redoing of an already resynchronized or recovered segment and,
      most importantly, potential for redoing of an already reshaped
      segment causing data corruption.
      Fixes: d39f0010 ("dm raid: fix raid_resume() to keep raid set frozen as needed")
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
  13. 08 Dec, 2017 1 commit
  14. 05 Oct, 2017 1 commit
    • Jonathan Brassow's avatar
      dm raid: fix incorrect status output at the end of a "recover" process · 41dcf197
      Jonathan Brassow authored
      There are three important fields that indicate the overall health and
      status of an array: dev_health, sync_ratio, and sync_action.  They tell
      us the condition of the devices in the array, and the degree to which
      the array is synchronized.
      This commit fixes a condition that is reported incorrectly.  When a member
      of the array is being rebuilt or a new device is added, the "recover"
      process is used to synchronize it with the rest of the array.  When the
      process is complete, but the sync thread hasn't yet been reaped, it is
      possible for the state of MD to be:
       curr_resync_completed = <max dev size> (but not MaxSector)
       and all rdevs to be In_sync.
      This causes the 'array_in_sync' output parameter that is passed to
      rs_get_progress() to be computed incorrectly and reported as 'false' --
      or not in-sync.  This in turn causes the dev_health status characters to
      be reported as all 'a', rather than the proper 'A'.
      This can cause erroneous output for several seconds at a time when tools
      will want to be checking the condition due to events that are raised at
      the end of a sync process.  Fix this by properly calculating the
      'array_in_sync' return parameter in rs_get_progress().
      Also, remove an unnecessary intermediate 'recovery_cp' variable in
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
  15. 25 Jul, 2017 1 commit
  16. 19 Jun, 2017 1 commit
    • Damien Le Moal's avatar
      dm zoned: drive-managed zoned block device target · 3b1a94c8
      Damien Le Moal authored
      The dm-zoned device mapper target provides transparent write access
      to zoned block devices (ZBC and ZAC compliant block devices).
      dm-zoned hides to the device user (a file system or an application
      doing raw block device accesses) any constraint imposed on write
      requests by the device, equivalent to a drive-managed zoned block
      device model.
      Write requests are processed using a combination of on-disk buffering
      using the device conventional zones and direct in-place processing for
      requests aligned to a zone sequential write pointer position.
      A background reclaim process implemented using dm_kcopyd_copy ensures
      that conventional zones are always available for executing unaligned
      write requests. The reclaim process overhead is minimized by managing
      buffer zones in a least-recently-written order and first targeting the
      oldest buffer zones. Doing so, blocks under regular write access (such
      as metadata blocks of a file system) remain stored in conventional
      zones, resulting in no apparent overhead.
      dm-zoned implementation focus on simplicity and on minimizing overhead
      (CPU, memory and storage overhead). For a 14TB host-managed disk with
      256 MB zones, dm-zoned memory usage per disk instance is at most about
      3 MB and as little as 5 zones will be used internally for storing metadata
      and performing buffer zone reclaim operations. This is achieved using
      zone level indirection rather than a full block indirection system for
      managing block movement between zones.
      dm-zoned primary target is host-managed zoned block devices but it can
      also be used with host-aware device models to mitigate potential
      device-side performance degradation due to excessive random writing.
      Zoned block devices can be formatted and checked for use with the dm-zoned
      target using the dmzadm utility available at:
      https://github.com/hgst/dm-zoned-toolsSigned-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      [Mike Snitzer partly refactored Damien's original work to cleanup the code]
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
  17. 24 Apr, 2017 2 commits
    • Mikulas Patocka's avatar
      dm integrity: support larger block sizes · 9d609f85
      Mikulas Patocka authored
      The DM integrity block size can now be 512, 1k, 2k or 4k.  Using larger
      blocks reduces metadata handling overhead.  The block size can be
      configured at table load time using the "block_size:<value>" option;
      where <value> is expressed in bytes (defult is still 512 bytes).
      It is safe to use larger block sizes with DM integrity, because the
      DM integrity journal makes sure that the whole block is updated
      atomically even if the underlying device doesn't support atomic writes
      of that size (e.g. 4k block ontop of a 512b device).
      Depends-on: 2859323e ("block: fix blk_integrity_register to use template's interval_exp if not 0")
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Mikulas Patocka's avatar
      dm integrity: various small changes and cleanups · 56b67a4f
      Mikulas Patocka authored
      Some coding style changes.
      Fix a bug that the array test_tag has insufficient size if the digest
      size of internal has is bigger than the tag size.
      The function __fls is undefined for zero argument, this patch fixes
      undefined behavior if the user sets zero interleave_sectors.
      Fix the limit of optional arguments to 8.
      Don't allocate crypt_data on the stack to avoid a BUG with debug kernel.
      Rename all optional argument names to have underscores rather than
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
  18. 27 Mar, 2017 2 commits
    • Heinz Mauelshagen's avatar
      dm raid: add raid4/5/6 journal write-back support via journal_mode option · 6e53636f
      Heinz Mauelshagen authored
      Commit 63c32ed4 ("dm raid: add raid4/5/6 journaling support") added
      journal support to close the raid4/5/6 "write hole" -- in terms of
      writethrough caching.
      Introduce a "journal_mode" feature and use the new
      r5c_journal_mode_set() API to add support for switching the journal
      device's cache mode between write-through (the current default) and
      NOTE: If the journal device is not layered on resilent storage and it
      fails, write-through mode will cause the "write hole" to reoccur.  But
      if the journal fails while in write-back mode it will cause data loss
      for any dirty cache entries unless resilent storage is used for the
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Heinz Mauelshagen's avatar
      dm raid: fix table line argument order in status · 4464e36e
      Heinz Mauelshagen authored
      Commit 3a1c1ef2 ("dm raid: enhance status interface and fixup
      takeover/raid0") added new table line arguments and introduced an
      ordering flaw.  The sequence of the raid10_copies and raid10_format
      raid parameters got reversed which causes lvm2 userspace to fail by
      falsely assuming a changed table line.
      Sequence those 2 parameters as before so that old lvm2 can function
      properly with new kernels by adjusting the table line output as
      documented in Documentation/device-mapper/dm-raid.txt.
      Also, add missing version 1.10.1 highlight to the documention.
      Fixes: 3a1c1ef2 ("dm raid: enhance status interface and fixup takeover/raid0")
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
  19. 24 Mar, 2017 5 commits
    • Mikulas Patocka's avatar
      dm integrity: add recovery mode · c2bcb2b7
      Mikulas Patocka authored
      In recovery mode, we don't:
      - replay the journal
      - check checksums
      - allow writes to the device
      This mode can be used as a last resort for data recovery.  The
      motivation for recovery mode is that when there is a single error in the
      journal, the user should not lose access to the whole device.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Milan Broz's avatar
      dm crypt: optionally support larger encryption sector size · 8f0009a2
      Milan Broz authored
      Add  optional "sector_size"  parameter that specifies encryption sector
      size (atomic unit of block device encryption).
      Parameter can be in range 512 - 4096 bytes and must be power of two.
      For compatibility reasons, the maximal IO must fit into the page limit,
      so the limit is set to the minimal page size possible (4096 bytes).
      NOTE: this device cannot yet be handled by cryptsetup if this parameter
      is set.
      IV for the sector is calculated from the 512 bytes sector offset unless
      the iv_large_sectors option is used.
      Test script using dmsetup:
        DEV_SIZE=$(blockdev --getsz $DEV)
        # dmsetup create test_crypt --table "0 $DEV_SIZE crypt aes-xts-plain64 $KEY 0 $DEV 0 1 sector_size:$BLOCK_SIZE"
        # dmsetup table --showkeys test_crypt
      Signed-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Milan Broz's avatar
      dm crypt: introduce new format of cipher with "capi:" prefix · 33d2f09f
      Milan Broz authored
      For the new authenticated encryption we have to support generic composed
      modes (combination of encryption algorithm and authenticator) because
      this is how the kernel crypto API accesses such algorithms.
      To simplify the interface, we accept an algorithm directly in crypto API
      format.  The new format is recognised by the "capi:" prefix.  The
      dmcrypt internal IV specification is the same as for the old format.
      The crypto API cipher specifications format is:
           capi:cbc(aes)-essiv:sha256 (equivalent to old aes-cbc-essiv:sha256)
           capi:xts(aes)-plain64      (equivalent to old aes-xts-plain64)
      Examples of authenticated modes:
      Authenticated modes can only be configured using the new cipher format.
      Note that this format allows user to specify arbitrary combinations that
      can be insecure. (Policy decision is done in cryptsetup userspace.)
      Authenticated encryption algorithms can be of two types, either native
      modes (like GCM) that performs both encryption and authentication
      internally, or composed modes where user can compose AEAD with separate
      specification of encryption algorithm and authenticator.
      For composed mode with HMAC (length-preserving encryption mode like an
      XTS and HMAC as an authenticator) we have to calculate HMAC digest size
      (the separate authentication key is the same size as the HMAC digest).
      Introduce crypt_ctr_auth_cipher() to parse the crypto API string to get
      HMAC algorithm and retrieve digest size from it.
      Also, for HMAC composed mode we need to parse the crypto API string to
      get the cipher mode nested in the specification.  For native AEAD mode
      (like GCM), we can use crypto_tfm_alg_name() API to get the cipher
      Because the HMAC composed mode is not processed the same as the native
      AEAD mode, the CRYPT_MODE_INTEGRITY_HMAC flag is no longer needed and
      "hmac" specification for the table integrity argument is removed.
      Signed-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Milan Broz's avatar
      dm crypt: add cryptographic data integrity protection (authenticated encryption) · ef43aa38
      Milan Broz authored
      Allow the use of per-sector metadata, provided by the dm-integrity
      module, for integrity protection and persistently stored per-sector
      Initialization Vector (IV).  The underlying device must support the
      "DM-DIF-EXT-TAG" dm-integrity profile.
      The per-bio integrity metadata is allocated by dm-crypt for every bio.
      Example of low-level mapping table for various types of use:
       # Additional HMAC with CBC-ESSIV, key is concatenated encryption key + HMAC key
       dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 32 J 0"
       dmsetup create y --table "0 $SIZE_INT crypt aes-cbc-essiv:sha256 \
       11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
       00112233445566778899aabbccddeeff00112233445566778899aabbccddeeff \
       0 /dev/mapper/x 0 1 integrity:32:hmac(sha256)"
       # AEAD (Authenticated Encryption with Additional Data) - GCM with random IVs
       # GCM in kernel uses 96bits IV and we store 128bits auth tag (so 28 bytes metadata space)
       dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 28 J 0"
       dmsetup create y --table "0 $SIZE_INT crypt aes-gcm-random \
       11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
       0 /dev/mapper/x 0 1 integrity:28:aead"
       # Random IV only for XTS mode (no integrity protection but provides atomic random sector change)
       dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 16 J 0"
       dmsetup create y --table "0 $SIZE_INT crypt aes-xts-random \
       11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
       0 /dev/mapper/x 0 1 integrity:16:none"
       # Random IV with XTS + HMAC integrity protection
       dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 48 J 0"
       dmsetup create y --table "0 $SIZE_INT crypt aes-xts-random \
       11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
       00112233445566778899aabbccddeeff00112233445566778899aabbccddeeff \
       0 /dev/mapper/x 0 1 integrity:48:hmac(sha256)"
      Both AEAD and HMAC protection authenticates not only data but also
      sector metadata.
      HMAC protection is implemented through autenc wrapper (so it is
      processed the same way as an authenticated mode).
      In HMAC mode there are two keys (concatenated in dm-crypt mapping
      table).  First is the encryption key and the second is the key for
      authentication (HMAC).  (It is userspace decision if these keys are
      independent or somehow derived.)
      The sector request for AEAD/HMAC authenticated encryption looks like this:
       |----- AAD -------|------ DATA -------|-- AUTH TAG --|
       | (authenticated) | (auth+encryption) |              |
       | sector_LE |  IV |  sector in/out    |  tag in/out  |
      For writes, the integrity fields are calculated during AEAD encryption
      of every sector and stored in bio integrity fields and sent to
      underlying dm-integrity target for storage.
      For reads, the integrity metadata is verified during AEAD decryption of
      every sector (they are filled in by dm-integrity, but the integrity
      fields are pre-allocated in dm-crypt).
      There is also an experimental support in cryptsetup utility for more
      friendly configuration (part of LUKS2 format).
      Because the integrity fields are not valid on initial creation, the
      device must be "formatted".  This can be done by direct-io writes to the
      device (e.g. dd in direct-io mode).  For now, there is available trivial
      tool to do this, see: https://github.com/mbroz/dm_int_toolsSigned-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      Signed-off-by: default avatarOndrej Mosnacek <omosnacek@gmail.com>
      Signed-off-by: default avatarVashek Matyas <matyas@fi.muni.cz>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Mikulas Patocka's avatar
      dm: add integrity target · 7eada909
      Mikulas Patocka authored
      The dm-integrity target emulates a block device that has additional
      per-sector tags that can be used for storing integrity information.
      A general problem with storing integrity tags with every sector is that
      writing the sector and the integrity tag must be atomic - i.e. in case of
      crash, either both sector and integrity tag or none of them is written.
      To guarantee write atomicity the dm-integrity target uses a journal. It
      writes sector data and integrity tags into a journal, commits the journal
      and then copies the data and integrity tags to their respective location.
      The dm-integrity target can be used with the dm-crypt target - in this
      situation the dm-crypt target creates the integrity data and passes them
      to the dm-integrity target via bio_integrity_payload attached to the bio.
      In this mode, the dm-crypt and dm-integrity targets provide authenticated
      disk encryption - if the attacker modifies the encrypted device, an I/O
      error is returned instead of random data.
      The dm-integrity target can also be used as a standalone target, in this
      mode it calculates and verifies the integrity tag internally. In this
      mode, the dm-integrity target can be used to detect silent data
      corruption on the disk or in the I/O path.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
  20. 19 Mar, 2017 1 commit
  21. 28 Feb, 2017 1 commit
  22. 16 Feb, 2017 1 commit
  23. 25 Jan, 2017 2 commits
    • Heinz Mauelshagen's avatar
      dm raid: add raid4/5/6 journaling support · 63c32ed4
      Heinz Mauelshagen authored
      Add md raid4/5/6 journaling support (upstream commit bac624f3 started
      the implementation) which closes the write hole (i.e. non-atomic updates
      to stripes) using a dedicated journal device.
      raid4/5/6 stripes hold N data payloads per stripe plus one parity raid4/5
      or two raid6 P/Q syndrome payloads in an in-memory stripe cache.
      Parity or P/Q syndromes used to recover any data payloads in case of a disk
      failure are calculated from the N data payloads and need to be updated on the
      different component devices of the raid device.  Those are non-atomic,
      persistent updates.  Hence a crash can cause failure to update all stripe
      payloads persistently and thus cause data loss during stripe recovery.
      This problem gets addressed by writing whole stripe cache entries (together with
      journal metadata) to a persistent journal entry on a dedicated journal device.
      Only if that journal entry is written successfully, the stripe cache entry is
      updated on the component devices of the raid device (i.e. writethrough type).
      In case of a crash, the entry can be recovered from the journal and be written
      again thus ensuring consistent stripe payload suitable to data recovery.
      Future dependencies:
      once writeback caching being worked on to compensate for the throughput
      implictions involved with writethrough overhead is supported with journaling
      in upstream, an additional patch based on this one will support it in dm-raid.
      Journal resilience related remarks:
      because stripes are recovered from the journal in case of a crash, the
      journal device better be resilient.  Resilience becomes mandatory with
      future writeback support, because loosing the working set in the log
      means data loss as oposed to writethrough, were the loss of the
      journal device 'only' reintroduces the write hole.
      Fix comment on data offsets in parse_dev_params() and initialize
      new_data_offset as well.
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Heinz Mauelshagen's avatar
      dm raid: fix transient device failure processing · c63ede3b
      Heinz Mauelshagen authored
      This fix addresses the following 3 failure scenarios:
      1) If a (transiently) inaccessible metadata device is being passed into the
      constructor (e.g. a device tuple '254:4 254:5'), it is processed as if
      '- -' was given.  This erroneously results in a status table line containing
      '- -', which mistakenly differs from what has been passed in.  As a result,
      userspace libdevmapper puts the device tuple seperate from the RAID device
      thus not processing the dependencies properly.
      2) False health status char 'A' instead of 'D' is emitted on the status
      status info line for the meta/data device tuple in this metadata device
      failure case.
      3) If the metadata device is accessible when passed into the constructor
      but the data device (partially) isn't, that leg may be set faulty by the
      raid personality on access to the (partially) unavailable leg.  Restore
      tried in a second raid device resume on such failed leg (status char 'D')
      fails after the (partial) leg returned.
      Fixes for aforementioned failure scenarios:
      - don't release passed in devices in the constructor thus allowing the
        status table line to e.g. contain '254:4 254:5' rather than '- -'
      - emit device status char 'D' rather than 'A' for the device tuple
        with the failed metadata device on the status info line
      - when attempting to restore faulty devices in a second resume, allow the
        device hot remove function to succeed by setting the device to not in-sync
      In case userspace intentionally passes '- -' into the constructor to avoid that
      device tuple (e.g. to split off a raid1 leg temporarily for later re-addition),
      the status table line will correctly show '- -' and the status info line will
      provide a '-' device health character for the non-defined device tuple.
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>