From 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sat, 16 Apr 2005 15:20:36 -0700 Subject: Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! --- Documentation/00-INDEX | 294 + Documentation/BK-usage/00-INDEX | 51 + Documentation/BK-usage/bk-kernel-howto.txt | 283 + Documentation/BK-usage/bk-make-sum | 34 + Documentation/BK-usage/bksend | 36 + Documentation/BK-usage/bz64wrap | 41 + Documentation/BK-usage/cpcset | 36 + Documentation/BK-usage/cset-to-linus | 49 + Documentation/BK-usage/csets-to-patches | 44 + Documentation/BK-usage/gcapatch | 8 + Documentation/BK-usage/unbz64wrap | 25 + Documentation/BUG-HUNTING | 92 + Documentation/Changes | 410 ++ Documentation/CodingStyle | 431 ++ Documentation/DMA-API.txt | 526 ++ Documentation/DMA-mapping.txt | 881 +++ Documentation/DocBook/Makefile | 195 + Documentation/DocBook/deviceiobook.tmpl | 341 ++ Documentation/DocBook/gadget.tmpl | 752 +++ Documentation/DocBook/journal-api.tmpl | 333 ++ Documentation/DocBook/kernel-api.tmpl | 342 ++ Documentation/DocBook/kernel-hacking.tmpl | 1349 +++++ Documentation/DocBook/kernel-locking.tmpl | 2088 +++++++ Documentation/DocBook/libata.tmpl | 282 + Documentation/DocBook/librs.tmpl | 289 + Documentation/DocBook/lsm.tmpl | 265 + Documentation/DocBook/man/Makefile | 3 + Documentation/DocBook/mcabook.tmpl | 107 + Documentation/DocBook/mtdnand.tmpl | 1320 +++++ Documentation/DocBook/procfs-guide.tmpl | 591 ++ Documentation/DocBook/procfs_example.c | 224 + Documentation/DocBook/scsidrivers.tmpl | 193 + Documentation/DocBook/sis900.tmpl | 585 ++ Documentation/DocBook/tulip-user.tmpl | 327 ++ Documentation/DocBook/usb.tmpl | 979 ++++ Documentation/DocBook/via-audio.tmpl | 597 ++ Documentation/DocBook/videobook.tmpl | 1663 ++++++ Documentation/DocBook/wanbook.tmpl | 99 + Documentation/DocBook/writing_usb_driver.tmpl | 419 ++ Documentation/DocBook/z8530book.tmpl | 385 ++ Documentation/IO-mapping.txt | 208 + Documentation/IPMI.txt | 534 ++ Documentation/IRQ-affinity.txt | 37 + Documentation/MSI-HOWTO.txt | 503 ++ Documentation/ManagementStyle | 276 + Documentation/PCIEBUS-HOWTO.txt | 217 + Documentation/RCU/RTFP.txt | 387 ++ Documentation/RCU/UP.txt | 64 + Documentation/RCU/arrayRCU.txt | 141 + Documentation/RCU/checklist.txt | 157 + Documentation/RCU/listRCU.txt | 307 + Documentation/RCU/rcu.txt | 67 + Documentation/README.DAC960 | 756 +++ Documentation/README.cycladesZ | 8 + Documentation/SAK.txt | 88 + Documentation/SecurityBugs | 38 + Documentation/SubmittingDrivers | 145 + Documentation/SubmittingPatches | 374 ++ Documentation/VGA-softcursor.txt | 39 + Documentation/aoe/aoe.txt | 91 + Documentation/aoe/autoload.sh | 17 + Documentation/aoe/mkdevs.sh | 36 + Documentation/aoe/mkshelf.sh | 25 + Documentation/aoe/status.sh | 31 + Documentation/aoe/udev-install.sh | 26 + Documentation/aoe/udev.txt | 23 + Documentation/arm/00-INDEX | 20 + Documentation/arm/Booting | 141 + Documentation/arm/IXP2000 | 69 + Documentation/arm/IXP4xx | 174 + Documentation/arm/Interrupts | 173 + Documentation/arm/Netwinder | 78 + Documentation/arm/Porting | 135 + Documentation/arm/README | 198 + Documentation/arm/SA1100/ADSBitsy | 43 + Documentation/arm/SA1100/Assabet | 301 + Documentation/arm/SA1100/Brutus | 66 + Documentation/arm/SA1100/CERF | 29 + Documentation/arm/SA1100/FreeBird | 21 + Documentation/arm/SA1100/GraphicsClient | 98 + Documentation/arm/SA1100/GraphicsMaster | 53 + Documentation/arm/SA1100/HUW_WEBPANEL | 17 + Documentation/arm/SA1100/Itsy | 39 + Documentation/arm/SA1100/LART | 14 + Documentation/arm/SA1100/PLEB | 11 + Documentation/arm/SA1100/Pangolin | 23 + Documentation/arm/SA1100/Tifon | 7 + Documentation/arm/SA1100/Victor | 16 + Documentation/arm/SA1100/Yopy | 2 + Documentation/arm/SA1100/empeg | 2 + Documentation/arm/SA1100/nanoEngine | 11 + Documentation/arm/SA1100/serial_UART | 47 + Documentation/arm/Samsung-S3C24XX/EB2410ITX.txt | 58 + Documentation/arm/Samsung-S3C24XX/GPIO.txt | 122 + Documentation/arm/Samsung-S3C24XX/H1940.txt | 40 + Documentation/arm/Samsung-S3C24XX/Overview.txt | 156 + Documentation/arm/Samsung-S3C24XX/SMDK2440.txt | 56 + Documentation/arm/Samsung-S3C24XX/Suspend.txt | 106 + Documentation/arm/Setup | 129 + Documentation/arm/Sharp-LH/CompactFlash | 32 + Documentation/arm/Sharp-LH/IOBarrier | 45 + Documentation/arm/Sharp-LH/KEV7A400 | 8 + Documentation/arm/Sharp-LH/LPD7A400 | 15 + Documentation/arm/Sharp-LH/LPD7A40X | 16 + Documentation/arm/Sharp-LH/SDRAM | 51 + .../arm/Sharp-LH/VectoredInterruptController | 80 + Documentation/arm/VFP/release-notes.txt | 55 + Documentation/arm/empeg/README | 13 + Documentation/arm/empeg/ir.txt | 49 + Documentation/arm/empeg/mkdevs | 11 + Documentation/arm/mem_alignment | 58 + Documentation/arm/memory.txt | 72 + Documentation/arm/nwfpe/NOTES | 29 + Documentation/arm/nwfpe/README | 70 + Documentation/arm/nwfpe/README.FPE | 156 + Documentation/arm/nwfpe/TODO | 67 + Documentation/atomic_ops.txt | 456 ++ Documentation/basic_profiling.txt | 52 + Documentation/binfmt_misc.txt | 116 + Documentation/block/as-iosched.txt | 165 + Documentation/block/biodoc.txt | 1213 ++++ Documentation/block/deadline-iosched.txt | 78 + Documentation/block/request.txt | 88 + Documentation/cachetlb.txt | 384 ++ Documentation/cciss.txt | 132 + Documentation/cdrom/00-INDEX | 33 + Documentation/cdrom/Makefile | 21 + Documentation/cdrom/aztcd | 822 +++ Documentation/cdrom/cdrom-standard.tex | 1022 ++++ Documentation/cdrom/cdu31a | 196 + Documentation/cdrom/cm206 | 185 + Documentation/cdrom/gscd | 60 + Documentation/cdrom/ide-cd | 574 ++ Documentation/cdrom/isp16 | 100 + Documentation/cdrom/mcdx | 29 + Documentation/cdrom/optcd | 57 + Documentation/cdrom/packet-writing.txt | 97 + Documentation/cdrom/sbpcd | 1057 ++++ Documentation/cdrom/sjcd | 60 + Documentation/cdrom/sonycd535 | 121 + Documentation/cli-sti-removal.txt | 133 + Documentation/computone.txt | 588 ++ Documentation/cpqarray.txt | 93 + Documentation/cpu-freq/amd-powernow.txt | 38 + Documentation/cpu-freq/core.txt | 98 + Documentation/cpu-freq/cpu-drivers.txt | 216 + Documentation/cpu-freq/cpufreq-nforce2.txt | 19 + Documentation/cpu-freq/governors.txt | 155 + Documentation/cpu-freq/index.txt | 56 + Documentation/cpu-freq/user-guide.txt | 185 + Documentation/cpusets.txt | 415 ++ Documentation/cris/README | 195 + Documentation/crypto/api-intro.txt | 244 + Documentation/crypto/descore-readme.txt | 352 ++ Documentation/debugging-modules.txt | 18 + Documentation/device-mapper/dm-io.txt | 75 + Documentation/device-mapper/kcopyd.txt | 47 + Documentation/device-mapper/linear.txt | 61 + Documentation/device-mapper/striped.txt | 58 + Documentation/device-mapper/zero.txt | 37 + Documentation/devices.txt | 3216 +++++++++++ Documentation/digiepca.txt | 98 + Documentation/dnotify.txt | 99 + Documentation/driver-model/binding.txt | 102 + Documentation/driver-model/bus.txt | 160 + Documentation/driver-model/class.txt | 162 + Documentation/driver-model/device.txt | 154 + Documentation/driver-model/driver.txt | 287 + Documentation/driver-model/interface.txt | 129 + Documentation/driver-model/overview.txt | 114 + Documentation/driver-model/platform.txt | 99 + Documentation/driver-model/porting.txt | 445 ++ Documentation/dvb/README.dibusb | 285 + Documentation/dvb/avermedia.txt | 304 + Documentation/dvb/bt8xx.txt | 90 + Documentation/dvb/cards.txt | 85 + Documentation/dvb/contributors.txt | 79 + Documentation/dvb/faq.txt | 160 + Documentation/dvb/get_dvb_firmware | 397 ++ Documentation/dvb/readme.txt | 52 + Documentation/dvb/ttusb-dec.txt | 44 + Documentation/dvb/udev.txt | 46 + Documentation/early-userspace/README | 152 + Documentation/early-userspace/buffer-format.txt | 112 + Documentation/eisa.txt | 203 + Documentation/exception.txt | 292 + Documentation/fb/00-INDEX | 25 + Documentation/fb/aty128fb.txt | 72 + Documentation/fb/cirrusfb.txt | 97 + Documentation/fb/framebuffer.txt | 345 ++ Documentation/fb/intel810.txt | 272 + Documentation/fb/internals.txt | 82 + Documentation/fb/matroxfb.txt | 415 ++ Documentation/fb/modedb.txt | 61 + Documentation/fb/pvr2fb.txt | 61 + Documentation/fb/pxafb.txt | 54 + Documentation/fb/sa1100fb.txt | 39 + Documentation/fb/sisfb.txt | 158 + Documentation/fb/sstfb.txt | 174 + Documentation/fb/tgafb.txt | 69 + Documentation/fb/tridentfb.txt | 54 + Documentation/fb/vesafb.txt | 167 + Documentation/feature-removal-schedule.txt | 42 + Documentation/filesystems/00-INDEX | 50 + Documentation/filesystems/Exporting | 176 + Documentation/filesystems/Locking | 515 ++ Documentation/filesystems/adfs.txt | 57 + Documentation/filesystems/affs.txt | 219 + Documentation/filesystems/afs.txt | 155 + Documentation/filesystems/automount-support.txt | 118 + Documentation/filesystems/befs.txt | 117 + Documentation/filesystems/bfs.txt | 57 + Documentation/filesystems/cifs.txt | 51 + Documentation/filesystems/coda.txt | 1673 ++++++ Documentation/filesystems/cramfs.txt | 76 + Documentation/filesystems/devfs/ChangeLog | 1977 +++++++ Documentation/filesystems/devfs/README | 1964 +++++++ Documentation/filesystems/devfs/ToDo | 40 + Documentation/filesystems/devfs/boot-options | 65 + Documentation/filesystems/directory-locking | 113 + Documentation/filesystems/ext2.txt | 383 ++ Documentation/filesystems/ext3.txt | 183 + Documentation/filesystems/hfs.txt | 83 + Documentation/filesystems/hpfs.txt | 296 + Documentation/filesystems/isofs.txt | 38 + Documentation/filesystems/jfs.txt | 35 + Documentation/filesystems/ncpfs.txt | 12 + Documentation/filesystems/ntfs.txt | 630 ++ Documentation/filesystems/porting | 266 + Documentation/filesystems/proc.txt | 1940 +++++++ Documentation/filesystems/romfs.txt | 187 + Documentation/filesystems/smbfs.txt | 8 + Documentation/filesystems/sysfs-pci.txt | 88 + Documentation/filesystems/sysfs.txt | 341 ++ Documentation/filesystems/sysv-fs.txt | 38 + Documentation/filesystems/tmpfs.txt | 100 + Documentation/filesystems/udf.txt | 57 + Documentation/filesystems/ufs.txt | 61 + Documentation/filesystems/vfat.txt | 231 + Documentation/filesystems/vfs.txt | 671 +++ Documentation/filesystems/xfs.txt | 188 + Documentation/firmware_class/README | 124 + .../firmware_class/firmware_sample_driver.c | 126 + .../firmware_sample_firmware_class.c | 204 + Documentation/firmware_class/hotplug-script | 16 + Documentation/floppy.txt | 245 + Documentation/ftape.txt | 307 + Documentation/fujitsu/frv/README.txt | 51 + Documentation/fujitsu/frv/atomic-ops.txt | 134 + Documentation/fujitsu/frv/booting.txt | 181 + Documentation/fujitsu/frv/clock.txt | 65 + Documentation/fujitsu/frv/configuring.txt | 125 + Documentation/fujitsu/frv/features.txt | 310 + Documentation/fujitsu/frv/gdbinit | 102 + Documentation/fujitsu/frv/gdbstub.txt | 130 + Documentation/fujitsu/frv/mmu-layout.txt | 306 + Documentation/hayes-esp.txt | 154 + Documentation/highuid.txt | 79 + Documentation/hpet.txt | 298 + Documentation/hw_random.txt | 69 + Documentation/i2c/busses/i2c-ali1535 | 42 + Documentation/i2c/busses/i2c-ali1563 | 27 + Documentation/i2c/busses/i2c-ali15x3 | 112 + Documentation/i2c/busses/i2c-amd756 | 25 + Documentation/i2c/busses/i2c-amd8111 | 41 + Documentation/i2c/busses/i2c-i801 | 80 + Documentation/i2c/busses/i2c-i810 | 46 + Documentation/i2c/busses/i2c-nforce2 | 41 + Documentation/i2c/busses/i2c-parport | 154 + Documentation/i2c/busses/i2c-parport-light | 11 + Documentation/i2c/busses/i2c-pca-isa | 23 + Documentation/i2c/busses/i2c-piix4 | 72 + Documentation/i2c/busses/i2c-prosavage | 23 + Documentation/i2c/busses/i2c-savage4 | 26 + Documentation/i2c/busses/i2c-sis5595 | 59 + Documentation/i2c/busses/i2c-sis630 | 49 + Documentation/i2c/busses/i2c-sis69x | 73 + Documentation/i2c/busses/i2c-via | 34 + Documentation/i2c/busses/i2c-viapro | 47 + Documentation/i2c/busses/i2c-voodoo3 | 62 + Documentation/i2c/busses/scx200_acb | 14 + Documentation/i2c/chips/smsc47b397.txt | 146 + Documentation/i2c/dev-interface | 146 + Documentation/i2c/functionality | 135 + Documentation/i2c/i2c-protocol | 76 + Documentation/i2c/i2c-stub | 38 + Documentation/i2c/porting-clients | 133 + Documentation/i2c/smbus-protocol | 216 + Documentation/i2c/summary | 75 + Documentation/i2c/sysfs-interface | 274 + Documentation/i2c/ten-bit-addresses | 22 + Documentation/i2c/writing-clients | 816 +++ Documentation/i2o/README | 63 + Documentation/i2o/ioctl | 394 ++ Documentation/i386/IO-APIC.txt | 117 + Documentation/i386/boot.txt | 441 ++ Documentation/i386/usb-legacy-support.txt | 44 + Documentation/i386/zero-page.txt | 84 + Documentation/ia64/IRQ-redir.txt | 69 + Documentation/ia64/README | 43 + Documentation/ia64/efirtc.txt | 128 + Documentation/ia64/fsys.txt | 286 + Documentation/ia64/serial.txt | 144 + Documentation/ibm-acpi.txt | 474 ++ Documentation/ide.txt | 394 ++ Documentation/infiniband/ipoib.txt | 56 + Documentation/infiniband/sysfs.txt | 66 + Documentation/infiniband/user_mad.txt | 99 + Documentation/initrd.txt | 340 ++ Documentation/input/amijoy.txt | 184 + Documentation/input/atarikbd.txt | 709 +++ Documentation/input/cd32.txt | 19 + Documentation/input/cs461x.txt | 45 + Documentation/input/ff.txt | 227 + Documentation/input/gameport-programming.txt | 189 + Documentation/input/iforce-protocol.txt | 254 + Documentation/input/input-programming.txt | 281 + Documentation/input/input.txt | 312 + Documentation/input/interactive.fig | 42 + Documentation/input/joystick-api.txt | 316 + Documentation/input/joystick-parport.txt | 542 ++ Documentation/input/joystick.txt | 588 ++ Documentation/input/shape.fig | 65 + Documentation/input/xpad.txt | 116 + Documentation/io_ordering.txt | 47 + Documentation/ioctl-number.txt | 196 + Documentation/ioctl/cdrom.txt | 966 ++++ Documentation/ioctl/hdio.txt | 1070 ++++ Documentation/iostats.txt | 150 + Documentation/isapnp.txt | 14 + Documentation/isdn/00-INDEX | 43 + Documentation/isdn/CREDITS | 70 + Documentation/isdn/HiSax.cert | 96 + Documentation/isdn/INTERFACE | 759 +++ Documentation/isdn/INTERFACE.fax | 163 + Documentation/isdn/README | 599 ++ Documentation/isdn/README.FAQ | 26 + Documentation/isdn/README.HiSax | 659 +++ Documentation/isdn/README.act2000 | 104 + Documentation/isdn/README.audio | 138 + Documentation/isdn/README.avmb1 | 187 + Documentation/isdn/README.concap | 259 + Documentation/isdn/README.diversion | 127 + Documentation/isdn/README.fax | 45 + Documentation/isdn/README.hfc-pci | 41 + Documentation/isdn/README.hysdn | 195 + Documentation/isdn/README.icn | 148 + Documentation/isdn/README.pcbit | 40 + Documentation/isdn/README.sc | 281 + Documentation/isdn/README.syncppp | 58 + Documentation/isdn/README.x25 | 184 + Documentation/isdn/syncPPP.FAQ | 224 + Documentation/java.txt | 396 ++ Documentation/kbuild/00-INDEX | 8 + Documentation/kbuild/kconfig-language.txt | 282 + Documentation/kbuild/makefiles.txt | 1122 ++++ Documentation/kbuild/modules.txt | 419 ++ Documentation/kernel-doc-nano-HOWTO.txt | 150 + Documentation/kernel-docs.txt | 777 +++ Documentation/kernel-parameters.txt | 1511 +++++ Documentation/keys.txt | 869 +++ Documentation/kobject.txt | 367 ++ Documentation/laptop-mode.txt | 950 +++ Documentation/ldm.txt | 102 + Documentation/locks.txt | 84 + Documentation/logo.gif | Bin 0 -> 16335 bytes Documentation/logo.txt | 13 + Documentation/m68k/00-INDEX | 5 + Documentation/m68k/README.buddha | 210 + Documentation/m68k/kernel-options.txt | 964 ++++ Documentation/magic-number.txt | 174 + Documentation/mandatory.txt | 152 + Documentation/mca.txt | 320 ++ Documentation/md.txt | 118 + Documentation/memory.txt | 60 + Documentation/mips/GT64120.README | 65 + Documentation/mips/pci/pci.README | 54 + Documentation/mips/time.README | 198 + Documentation/mono.txt | 66 + Documentation/moxa-smartio | 411 ++ Documentation/mtrr.txt | 286 + Documentation/nbd.txt | 47 + Documentation/networking/00-INDEX | 127 + Documentation/networking/3c359.txt | 58 + Documentation/networking/3c505.txt | 46 + Documentation/networking/3c509.txt | 210 + Documentation/networking/6pack.txt | 175 + Documentation/networking/Configurable | 34 + Documentation/networking/DLINK.txt | 204 + Documentation/networking/NAPI_HOWTO.txt | 766 +++ Documentation/networking/PLIP.txt | 215 + Documentation/networking/README.sb1000 | 207 + Documentation/networking/TODO | 18 + Documentation/networking/alias.txt | 53 + Documentation/networking/arcnet-hardware.txt | 3133 ++++++++++ Documentation/networking/arcnet.txt | 555 ++ Documentation/networking/atm.txt | 8 + Documentation/networking/ax25.txt | 16 + Documentation/networking/baycom.txt | 158 + Documentation/networking/bonding.txt | 1618 ++++++ Documentation/networking/bridge.txt | 8 + Documentation/networking/comx.txt | 248 + Documentation/networking/cops.txt | 63 + Documentation/networking/cs89x0.txt | 703 +++ Documentation/networking/de4x5.txt | 178 + Documentation/networking/decnet.txt | 234 + Documentation/networking/depca.txt | 92 + Documentation/networking/dgrs.txt | 52 + Documentation/networking/dl2k.txt | 281 + Documentation/networking/dmfe.txt | 59 + Documentation/networking/driver.txt | 94 + Documentation/networking/e100.txt | 170 + Documentation/networking/e1000.txt | 401 ++ Documentation/networking/eql.txt | 528 ++ Documentation/networking/ewrk3.txt | 46 + Documentation/networking/filter.txt | 42 + Documentation/networking/fore200e.txt | 66 + Documentation/networking/framerelay.txt | 39 + Documentation/networking/gen_stats.txt | 117 + Documentation/networking/generic-hdlc.txt | 131 + Documentation/networking/ifenslave.c | 1110 ++++ Documentation/networking/ip-sysctl.txt | 878 +++ Documentation/networking/ip_dynaddr.txt | 29 + Documentation/networking/ipddp.txt | 78 + Documentation/networking/iphase.txt | 158 + Documentation/networking/irda.txt | 14 + Documentation/networking/ixgb.txt | 212 + Documentation/networking/lapb-module.txt | 263 + Documentation/networking/ltpc.txt | 131 + Documentation/networking/multicast.txt | 64 + Documentation/networking/ncsa-telnet | 16 + Documentation/networking/net-modules.txt | 324 ++ Documentation/networking/netconsole.txt | 57 + Documentation/networking/netdevices.txt | 75 + Documentation/networking/netif-msg.txt | 79 + Documentation/networking/olympic.txt | 79 + Documentation/networking/packet_mmap.txt | 399 ++ Documentation/networking/pktgen.txt | 214 + Documentation/networking/policy-routing.txt | 150 + Documentation/networking/ppp_generic.txt | 432 ++ Documentation/networking/proc_net_tcp.txt | 47 + Documentation/networking/pt.txt | 58 + Documentation/networking/ray_cs.txt | 151 + Documentation/networking/routing.txt | 46 + Documentation/networking/s2io.txt | 48 + Documentation/networking/sctp.txt | 38 + Documentation/networking/shaper.txt | 48 + Documentation/networking/sis900.txt | 257 + Documentation/networking/sk98lin.txt | 568 ++ Documentation/networking/skfp.txt | 220 + Documentation/networking/slicecom.hun | 371 ++ Documentation/networking/slicecom.txt | 369 ++ Documentation/networking/smc9.txt | 42 + Documentation/networking/smctr.txt | 66 + Documentation/networking/tcp.txt | 39 + Documentation/networking/tlan.txt | 117 + Documentation/networking/tms380tr.txt | 147 + Documentation/networking/tuntap.txt | 147 + Documentation/networking/vortex.txt | 450 ++ Documentation/networking/wan-router.txt | 622 ++ Documentation/networking/wanpipe.txt | 622 ++ Documentation/networking/wavelan.txt | 73 + Documentation/networking/x25-iface.txt | 123 + Documentation/networking/x25.txt | 44 + Documentation/networking/z8530drv.txt | 657 +++ Documentation/nfsroot.txt | 210 + Documentation/nmi_watchdog.txt | 81 + Documentation/nommu-mmap.txt | 198 + Documentation/numastat.txt | 22 + Documentation/oops-tracing.txt | 229 + Documentation/paride.txt | 417 ++ Documentation/parisc/00-INDEX | 6 + Documentation/parisc/debugging | 39 + Documentation/parisc/registers | 121 + Documentation/parport-lowlevel.txt | 1490 +++++ Documentation/parport.txt | 268 + Documentation/pci.txt | 284 + Documentation/pm.txt | 251 + Documentation/pnp.txt | 249 + Documentation/power/devices.txt | 319 ++ Documentation/power/interface.txt | 43 + Documentation/power/kernel_threads.txt | 41 + Documentation/power/pci.txt | 332 ++ Documentation/power/states.txt | 79 + Documentation/power/swsusp.txt | 235 + Documentation/power/tricks.txt | 27 + Documentation/power/video.txt | 169 + Documentation/power/video_extension.txt | 34 + Documentation/powerpc/00-INDEX | 20 + Documentation/powerpc/SBC8260_memory_mapping.txt | 197 + Documentation/powerpc/cpu_features.txt | 56 + Documentation/powerpc/eeh-pci-error-recovery.txt | 332 ++ Documentation/powerpc/hvcs.txt | 567 ++ Documentation/powerpc/mpc52xx.txt | 39 + Documentation/powerpc/ppc_htab.txt | 118 + Documentation/powerpc/smp.txt | 34 + Documentation/powerpc/sound.txt | 81 + Documentation/powerpc/zImage_layout.txt | 47 + Documentation/preempt-locking.txt | 135 + Documentation/prio_tree.txt | 107 + Documentation/ramdisk.txt | 167 + Documentation/riscom8.txt | 36 + Documentation/rocket.txt | 189 + Documentation/rpc-cache.txt | 171 + Documentation/rtc.txt | 282 + Documentation/s390/3270.ChangeLog | 44 + Documentation/s390/3270.txt | 274 + Documentation/s390/CommonIO | 109 + Documentation/s390/DASD | 73 + Documentation/s390/Debugging390.txt | 2536 ++++++++ Documentation/s390/TAPE | 122 + Documentation/s390/cds.txt | 513 ++ Documentation/s390/config3270.sh | 76 + Documentation/s390/crypto/crypto-API.txt | 83 + Documentation/s390/driver-model.txt | 265 + Documentation/s390/monreader.txt | 197 + Documentation/s390/s390dbf.txt | 615 ++ Documentation/sched-coding.txt | 126 + Documentation/sched-design.txt | 165 + Documentation/sched-domains.txt | 70 + Documentation/sched-stats.txt | 153 + Documentation/scsi/00-INDEX | 70 + Documentation/scsi/53c700.txt | 154 + Documentation/scsi/BusLogic.txt | 566 ++ Documentation/scsi/ChangeLog.1992-1997 | 2023 +++++++ Documentation/scsi/ChangeLog.ips | 122 + Documentation/scsi/ChangeLog.megaraid | 349 ++ Documentation/scsi/ChangeLog.ncr53c8xx | 495 ++ Documentation/scsi/ChangeLog.sym53c8xx | 593 ++ Documentation/scsi/ChangeLog.sym53c8xx_2 | 144 + Documentation/scsi/FlashPoint.txt | 163 + Documentation/scsi/LICENSE.FlashPoint | 60 + Documentation/scsi/Mylex.txt | 5 + Documentation/scsi/NinjaSCSI.txt | 130 + Documentation/scsi/aha152x.txt | 183 + Documentation/scsi/aic79xx.txt | 516 ++ Documentation/scsi/aic7xxx.txt | 414 ++ Documentation/scsi/aic7xxx_old.txt | 511 ++ Documentation/scsi/cpqfc.txt | 272 + Documentation/scsi/dc395x.txt | 102 + Documentation/scsi/dpti.txt | 83 + Documentation/scsi/dtc3x80.txt | 43 + Documentation/scsi/g_NCR5380.txt | 63 + Documentation/scsi/ibmmca.txt | 1402 +++++ Documentation/scsi/in2000.txt | 202 + Documentation/scsi/megaraid.txt | 70 + Documentation/scsi/ncr53c7xx.txt | 40 + Documentation/scsi/ncr53c8xx.txt | 1854 ++++++ Documentation/scsi/osst.txt | 219 + Documentation/scsi/ppa.txt | 16 + Documentation/scsi/qla2xxx.revision.notes | 457 ++ Documentation/scsi/qlogicfas.txt | 79 + Documentation/scsi/qlogicisp.txt | 30 + Documentation/scsi/scsi-generic.txt | 101 + Documentation/scsi/scsi.txt | 44 + Documentation/scsi/scsi_mid_low_api.txt | 1546 +++++ Documentation/scsi/st.txt | 499 ++ Documentation/scsi/sym53c500_cs.txt | 23 + Documentation/scsi/sym53c8xx_2.txt | 1059 ++++ Documentation/scsi/tmscsim.txt | 449 ++ Documentation/seclvl.txt | 97 + Documentation/serial-console.txt | 104 + Documentation/serial/driver | 330 ++ Documentation/sgi-visws.txt | 13 + Documentation/sh/kgdb.txt | 179 + Documentation/sh/new-machine.txt | 306 + Documentation/smart-config.txt | 102 + Documentation/smp.txt | 22 + Documentation/sonypi.txt | 142 + Documentation/sound/alsa/ALSA-Configuration.txt | 1505 +++++ Documentation/sound/alsa/Audigy-mixer.txt | 345 ++ Documentation/sound/alsa/Bt87x.txt | 78 + Documentation/sound/alsa/CMIPCI.txt | 242 + Documentation/sound/alsa/ControlNames.txt | 84 + .../sound/alsa/DocBook/alsa-driver-api.tmpl | 100 + .../sound/alsa/DocBook/writing-an-alsa-driver.tmpl | 6045 ++++++++++++++++++++ Documentation/sound/alsa/Joystick.txt | 86 + Documentation/sound/alsa/MIXART.txt | 100 + Documentation/sound/alsa/OSS-Emulation.txt | 297 + Documentation/sound/alsa/Procfile.txt | 191 + Documentation/sound/alsa/SB-Live-mixer.txt | 356 ++ Documentation/sound/alsa/VIA82xx-mixer.txt | 8 + Documentation/sound/alsa/hda_codec.txt | 299 + Documentation/sound/alsa/seq_oss.html | 409 ++ Documentation/sound/alsa/serial-u16550.txt | 88 + Documentation/sound/oss/AD1816 | 84 + Documentation/sound/oss/ALS | 66 + Documentation/sound/oss/AWE32 | 76 + Documentation/sound/oss/AudioExcelDSP16 | 101 + Documentation/sound/oss/CMI8330 | 153 + Documentation/sound/oss/CMI8338 | 85 + Documentation/sound/oss/CS4232 | 23 + Documentation/sound/oss/ESS | 34 + Documentation/sound/oss/ESS1868 | 55 + Documentation/sound/oss/INSTALL.awe | 134 + Documentation/sound/oss/Introduction | 459 ++ Documentation/sound/oss/MAD16 | 56 + Documentation/sound/oss/Maestro | 123 + Documentation/sound/oss/Maestro3 | 92 + Documentation/sound/oss/MultiSound | 1137 ++++ Documentation/sound/oss/NEWS | 42 + Documentation/sound/oss/NM256 | 280 + Documentation/sound/oss/OPL3 | 6 + Documentation/sound/oss/OPL3-SA | 52 + Documentation/sound/oss/OPL3-SA2 | 210 + Documentation/sound/oss/Opti | 222 + Documentation/sound/oss/PAS16 | 163 + Documentation/sound/oss/PSS | 41 + Documentation/sound/oss/PSS-updates | 88 + Documentation/sound/oss/README.OSS | 1456 +++++ Documentation/sound/oss/README.awe | 218 + Documentation/sound/oss/README.modules | 106 + Documentation/sound/oss/README.ymfsb | 107 + Documentation/sound/oss/SoundPro | 105 + Documentation/sound/oss/Soundblaster | 53 + Documentation/sound/oss/Tropez+ | 26 + Documentation/sound/oss/VIA-chipset | 43 + Documentation/sound/oss/VIBRA16 | 80 + Documentation/sound/oss/WaveArtist | 170 + Documentation/sound/oss/Wavefront | 339 ++ Documentation/sound/oss/btaudio | 92 + Documentation/sound/oss/cs46xx | 138 + Documentation/sound/oss/es1370 | 70 + Documentation/sound/oss/es1371 | 64 + Documentation/sound/oss/mwave | 185 + Documentation/sound/oss/rme96xx | 767 +++ Documentation/sound/oss/solo1 | 70 + Documentation/sound/oss/sonicvibes | 81 + Documentation/sound/oss/ultrasound | 30 + Documentation/sound/oss/vwsnd | 293 + Documentation/sparc/README-2.5 | 46 + Documentation/sparc/sbus_drivers.txt | 272 + Documentation/sparse.txt | 72 + Documentation/specialix.txt | 385 ++ Documentation/spinlocks.txt | 212 + Documentation/stable_api_nonsense.txt | 193 + Documentation/stallion.txt | 392 ++ Documentation/svga.txt | 276 + Documentation/sx.txt | 294 + Documentation/sysctl/README | 75 + Documentation/sysctl/abi.txt | 54 + Documentation/sysctl/fs.txt | 150 + Documentation/sysctl/kernel.txt | 314 + Documentation/sysctl/sunrpc.txt | 20 + Documentation/sysctl/vm.txt | 104 + Documentation/sysrq.txt | 213 + Documentation/telephony/ixj.txt | 406 ++ Documentation/time_interpolators.txt | 41 + Documentation/tipar.txt | 93 + Documentation/tty.txt | 198 + Documentation/uml/UserModeLinux-HOWTO.txt | 4686 +++++++++++++++ Documentation/unicode.txt | 175 + Documentation/usb/CREDITS | 175 + Documentation/usb/URB.txt | 252 + Documentation/usb/acm.txt | 138 + Documentation/usb/auerswald.txt | 30 + Documentation/usb/bluetooth.txt | 44 + Documentation/usb/dma.txt | 116 + Documentation/usb/ehci.txt | 212 + Documentation/usb/error-codes.txt | 167 + Documentation/usb/gadget_serial.txt | 332 ++ Documentation/usb/hiddev.txt | 205 + Documentation/usb/hotplug.txt | 148 + Documentation/usb/ibmcam.txt | 324 ++ Documentation/usb/linux.inf | 200 + Documentation/usb/mtouchusb.txt | 76 + Documentation/usb/ohci.txt | 32 + Documentation/usb/ov511.txt | 289 + Documentation/usb/proc_usb_info.txt | 371 ++ Documentation/usb/rio.txt | 138 + Documentation/usb/se401.txt | 54 + Documentation/usb/sn9c102.txt | 480 ++ Documentation/usb/stv680.txt | 55 + Documentation/usb/uhci.txt | 165 + Documentation/usb/usb-help.txt | 19 + Documentation/usb/usb-serial.txt | 470 ++ Documentation/usb/usbmon.txt | 156 + Documentation/usb/w9968cf.txt | 481 ++ Documentation/video4linux/API.html | 399 ++ Documentation/video4linux/CARDLIST.bttv | 121 + Documentation/video4linux/CARDLIST.saa7134 | 35 + Documentation/video4linux/CARDLIST.tuner | 46 + Documentation/video4linux/CQcam.txt | 412 ++ Documentation/video4linux/README.cpia | 191 + Documentation/video4linux/README.cx88 | 69 + Documentation/video4linux/README.ir | 72 + Documentation/video4linux/README.saa7134 | 73 + Documentation/video4linux/Zoran | 557 ++ Documentation/video4linux/bttv/CONTRIBUTORS | 25 + Documentation/video4linux/bttv/Cards | 964 ++++ Documentation/video4linux/bttv/ICs | 37 + Documentation/video4linux/bttv/Insmod-options | 173 + Documentation/video4linux/bttv/MAKEDEV | 28 + Documentation/video4linux/bttv/Modprobe.conf | 11 + Documentation/video4linux/bttv/Modules.conf | 14 + Documentation/video4linux/bttv/PROBLEMS | 62 + Documentation/video4linux/bttv/README | 90 + Documentation/video4linux/bttv/README.WINVIEW | 33 + Documentation/video4linux/bttv/README.freeze | 74 + Documentation/video4linux/bttv/README.quirks | 83 + Documentation/video4linux/bttv/Sound-FAQ | 148 + Documentation/video4linux/bttv/Specs | 3 + Documentation/video4linux/bttv/THANKS | 24 + Documentation/video4linux/bttv/Tuners | 115 + Documentation/video4linux/meye.txt | 130 + Documentation/video4linux/radiotrack.txt | 147 + Documentation/video4linux/w9966.txt | 33 + Documentation/video4linux/zr36120.txt | 159 + Documentation/vm/balance | 93 + Documentation/vm/hugetlbpage.txt | 284 + Documentation/vm/locking | 131 + Documentation/vm/numa | 41 + Documentation/vm/overcommit-accounting | 73 + Documentation/voyager.txt | 95 + Documentation/w1/w1.generic | 19 + Documentation/watchdog/pcwd-watchdog.txt | 135 + Documentation/watchdog/watchdog-api.txt | 399 ++ Documentation/watchdog/watchdog.txt | 115 + Documentation/x86_64/boot-options.txt | 180 + Documentation/x86_64/mm.txt | 24 + Documentation/xterm-linux.xpm | 61 + Documentation/zorro.txt | 102 + 722 files changed, 177485 insertions(+) create mode 100644 Documentation/00-INDEX create mode 100644 Documentation/BK-usage/00-INDEX create mode 100644 Documentation/BK-usage/bk-kernel-howto.txt create mode 100755 Documentation/BK-usage/bk-make-sum create mode 100755 Documentation/BK-usage/bksend create mode 100755 Documentation/BK-usage/bz64wrap create mode 100755 Documentation/BK-usage/cpcset create mode 100755 Documentation/BK-usage/cset-to-linus create mode 100755 Documentation/BK-usage/csets-to-patches create mode 100755 Documentation/BK-usage/gcapatch create mode 100755 Documentation/BK-usage/unbz64wrap create mode 100644 Documentation/BUG-HUNTING create mode 100644 Documentation/Changes create mode 100644 Documentation/CodingStyle create mode 100644 Documentation/DMA-API.txt create mode 100644 Documentation/DMA-mapping.txt create mode 100644 Documentation/DocBook/Makefile create mode 100644 Documentation/DocBook/deviceiobook.tmpl create mode 100644 Documentation/DocBook/gadget.tmpl create mode 100644 Documentation/DocBook/journal-api.tmpl create mode 100644 Documentation/DocBook/kernel-api.tmpl create mode 100644 Documentation/DocBook/kernel-hacking.tmpl create mode 100644 Documentation/DocBook/kernel-locking.tmpl create mode 100644 Documentation/DocBook/libata.tmpl create mode 100644 Documentation/DocBook/librs.tmpl create mode 100644 Documentation/DocBook/lsm.tmpl create mode 100644 Documentation/DocBook/man/Makefile create mode 100644 Documentation/DocBook/mcabook.tmpl create mode 100644 Documentation/DocBook/mtdnand.tmpl create mode 100644 Documentation/DocBook/procfs-guide.tmpl create mode 100644 Documentation/DocBook/procfs_example.c create mode 100644 Documentation/DocBook/scsidrivers.tmpl create mode 100644 Documentation/DocBook/sis900.tmpl create mode 100644 Documentation/DocBook/tulip-user.tmpl create mode 100644 Documentation/DocBook/usb.tmpl create mode 100644 Documentation/DocBook/via-audio.tmpl create mode 100644 Documentation/DocBook/videobook.tmpl create mode 100644 Documentation/DocBook/wanbook.tmpl create mode 100644 Documentation/DocBook/writing_usb_driver.tmpl create mode 100644 Documentation/DocBook/z8530book.tmpl create mode 100644 Documentation/IO-mapping.txt create mode 100644 Documentation/IPMI.txt create mode 100644 Documentation/IRQ-affinity.txt create mode 100644 Documentation/MSI-HOWTO.txt create mode 100644 Documentation/ManagementStyle create mode 100644 Documentation/PCIEBUS-HOWTO.txt create mode 100644 Documentation/RCU/RTFP.txt create mode 100644 Documentation/RCU/UP.txt create mode 100644 Documentation/RCU/arrayRCU.txt create mode 100644 Documentation/RCU/checklist.txt create mode 100644 Documentation/RCU/listRCU.txt create mode 100644 Documentation/RCU/rcu.txt create mode 100644 Documentation/README.DAC960 create mode 100644 Documentation/README.cycladesZ create mode 100644 Documentation/SAK.txt create mode 100644 Documentation/SecurityBugs create mode 100644 Documentation/SubmittingDrivers create mode 100644 Documentation/SubmittingPatches create mode 100644 Documentation/VGA-softcursor.txt create mode 100644 Documentation/aoe/aoe.txt create mode 100644 Documentation/aoe/autoload.sh create mode 100644 Documentation/aoe/mkdevs.sh create mode 100644 Documentation/aoe/mkshelf.sh create mode 100644 Documentation/aoe/status.sh create mode 100644 Documentation/aoe/udev-install.sh create mode 100644 Documentation/aoe/udev.txt create mode 100644 Documentation/arm/00-INDEX create mode 100644 Documentation/arm/Booting create mode 100644 Documentation/arm/IXP2000 create mode 100644 Documentation/arm/IXP4xx create mode 100644 Documentation/arm/Interrupts create mode 100644 Documentation/arm/Netwinder create mode 100644 Documentation/arm/Porting create mode 100644 Documentation/arm/README create mode 100644 Documentation/arm/SA1100/ADSBitsy create mode 100644 Documentation/arm/SA1100/Assabet create mode 100644 Documentation/arm/SA1100/Brutus create mode 100644 Documentation/arm/SA1100/CERF create mode 100644 Documentation/arm/SA1100/FreeBird create mode 100644 Documentation/arm/SA1100/GraphicsClient create mode 100644 Documentation/arm/SA1100/GraphicsMaster create mode 100644 Documentation/arm/SA1100/HUW_WEBPANEL create mode 100644 Documentation/arm/SA1100/Itsy create mode 100644 Documentation/arm/SA1100/LART create mode 100644 Documentation/arm/SA1100/PLEB create mode 100644 Documentation/arm/SA1100/Pangolin create mode 100644 Documentation/arm/SA1100/Tifon create mode 100644 Documentation/arm/SA1100/Victor create mode 100644 Documentation/arm/SA1100/Yopy create mode 100644 Documentation/arm/SA1100/empeg create mode 100644 Documentation/arm/SA1100/nanoEngine create mode 100644 Documentation/arm/SA1100/serial_UART create mode 100644 Documentation/arm/Samsung-S3C24XX/EB2410ITX.txt create mode 100644 Documentation/arm/Samsung-S3C24XX/GPIO.txt create mode 100644 Documentation/arm/Samsung-S3C24XX/H1940.txt create mode 100644 Documentation/arm/Samsung-S3C24XX/Overview.txt create mode 100644 Documentation/arm/Samsung-S3C24XX/SMDK2440.txt create mode 100644 Documentation/arm/Samsung-S3C24XX/Suspend.txt create mode 100644 Documentation/arm/Setup create mode 100644 Documentation/arm/Sharp-LH/CompactFlash create mode 100644 Documentation/arm/Sharp-LH/IOBarrier create mode 100644 Documentation/arm/Sharp-LH/KEV7A400 create mode 100644 Documentation/arm/Sharp-LH/LPD7A400 create mode 100644 Documentation/arm/Sharp-LH/LPD7A40X create mode 100644 Documentation/arm/Sharp-LH/SDRAM create mode 100644 Documentation/arm/Sharp-LH/VectoredInterruptController create mode 100644 Documentation/arm/VFP/release-notes.txt create mode 100644 Documentation/arm/empeg/README create mode 100644 Documentation/arm/empeg/ir.txt create mode 100644 Documentation/arm/empeg/mkdevs create mode 100644 Documentation/arm/mem_alignment create mode 100644 Documentation/arm/memory.txt create mode 100644 Documentation/arm/nwfpe/NOTES create mode 100644 Documentation/arm/nwfpe/README create mode 100644 Documentation/arm/nwfpe/README.FPE create mode 100644 Documentation/arm/nwfpe/TODO create mode 100644 Documentation/atomic_ops.txt create mode 100644 Documentation/basic_profiling.txt create mode 100644 Documentation/binfmt_misc.txt create mode 100644 Documentation/block/as-iosched.txt create mode 100644 Documentation/block/biodoc.txt create mode 100644 Documentation/block/deadline-iosched.txt create mode 100644 Documentation/block/request.txt create mode 100644 Documentation/cachetlb.txt create mode 100644 Documentation/cciss.txt create mode 100644 Documentation/cdrom/00-INDEX create mode 100644 Documentation/cdrom/Makefile create mode 100644 Documentation/cdrom/aztcd create mode 100644 Documentation/cdrom/cdrom-standard.tex create mode 100644 Documentation/cdrom/cdu31a create mode 100644 Documentation/cdrom/cm206 create mode 100644 Documentation/cdrom/gscd create mode 100644 Documentation/cdrom/ide-cd create mode 100644 Documentation/cdrom/isp16 create mode 100644 Documentation/cdrom/mcdx create mode 100644 Documentation/cdrom/optcd create mode 100644 Documentation/cdrom/packet-writing.txt create mode 100644 Documentation/cdrom/sbpcd create mode 100644 Documentation/cdrom/sjcd create mode 100644 Documentation/cdrom/sonycd535 create mode 100644 Documentation/cli-sti-removal.txt create mode 100644 Documentation/computone.txt create mode 100644 Documentation/cpqarray.txt create mode 100644 Documentation/cpu-freq/amd-powernow.txt create mode 100644 Documentation/cpu-freq/core.txt create mode 100644 Documentation/cpu-freq/cpu-drivers.txt create mode 100644 Documentation/cpu-freq/cpufreq-nforce2.txt create mode 100644 Documentation/cpu-freq/governors.txt create mode 100644 Documentation/cpu-freq/index.txt create mode 100644 Documentation/cpu-freq/user-guide.txt create mode 100644 Documentation/cpusets.txt create mode 100644 Documentation/cris/README create mode 100644 Documentation/crypto/api-intro.txt create mode 100644 Documentation/crypto/descore-readme.txt create mode 100644 Documentation/debugging-modules.txt create mode 100644 Documentation/device-mapper/dm-io.txt create mode 100644 Documentation/device-mapper/kcopyd.txt create mode 100644 Documentation/device-mapper/linear.txt create mode 100644 Documentation/device-mapper/striped.txt create mode 100644 Documentation/device-mapper/zero.txt create mode 100644 Documentation/devices.txt create mode 100644 Documentation/digiepca.txt create mode 100644 Documentation/dnotify.txt create mode 100644 Documentation/driver-model/binding.txt create mode 100644 Documentation/driver-model/bus.txt create mode 100644 Documentation/driver-model/class.txt create mode 100644 Documentation/driver-model/device.txt create mode 100644 Documentation/driver-model/driver.txt create mode 100644 Documentation/driver-model/interface.txt create mode 100644 Documentation/driver-model/overview.txt create mode 100644 Documentation/driver-model/platform.txt create mode 100644 Documentation/driver-model/porting.txt create mode 100644 Documentation/dvb/README.dibusb create mode 100644 Documentation/dvb/avermedia.txt create mode 100644 Documentation/dvb/bt8xx.txt create mode 100644 Documentation/dvb/cards.txt create mode 100644 Documentation/dvb/contributors.txt create mode 100644 Documentation/dvb/faq.txt create mode 100644 Documentation/dvb/get_dvb_firmware create mode 100644 Documentation/dvb/readme.txt create mode 100644 Documentation/dvb/ttusb-dec.txt create mode 100644 Documentation/dvb/udev.txt create mode 100644 Documentation/early-userspace/README create mode 100644 Documentation/early-userspace/buffer-format.txt create mode 100644 Documentation/eisa.txt create mode 100644 Documentation/exception.txt create mode 100644 Documentation/fb/00-INDEX create mode 100644 Documentation/fb/aty128fb.txt create mode 100644 Documentation/fb/cirrusfb.txt create mode 100644 Documentation/fb/framebuffer.txt create mode 100644 Documentation/fb/intel810.txt create mode 100644 Documentation/fb/internals.txt create mode 100644 Documentation/fb/matroxfb.txt create mode 100644 Documentation/fb/modedb.txt create mode 100644 Documentation/fb/pvr2fb.txt create mode 100644 Documentation/fb/pxafb.txt create mode 100644 Documentation/fb/sa1100fb.txt create mode 100644 Documentation/fb/sisfb.txt create mode 100644 Documentation/fb/sstfb.txt create mode 100644 Documentation/fb/tgafb.txt create mode 100644 Documentation/fb/tridentfb.txt create mode 100644 Documentation/fb/vesafb.txt create mode 100644 Documentation/feature-removal-schedule.txt create mode 100644 Documentation/filesystems/00-INDEX create mode 100644 Documentation/filesystems/Exporting create mode 100644 Documentation/filesystems/Locking create mode 100644 Documentation/filesystems/adfs.txt create mode 100644 Documentation/filesystems/affs.txt create mode 100644 Documentation/filesystems/afs.txt create mode 100644 Documentation/filesystems/automount-support.txt create mode 100644 Documentation/filesystems/befs.txt create mode 100644 Documentation/filesystems/bfs.txt create mode 100644 Documentation/filesystems/cifs.txt create mode 100644 Documentation/filesystems/coda.txt create mode 100644 Documentation/filesystems/cramfs.txt create mode 100644 Documentation/filesystems/devfs/ChangeLog create mode 100644 Documentation/filesystems/devfs/README create mode 100644 Documentation/filesystems/devfs/ToDo create mode 100644 Documentation/filesystems/devfs/boot-options create mode 100644 Documentation/filesystems/directory-locking create mode 100644 Documentation/filesystems/ext2.txt create mode 100644 Documentation/filesystems/ext3.txt create mode 100644 Documentation/filesystems/hfs.txt create mode 100644 Documentation/filesystems/hpfs.txt create mode 100644 Documentation/filesystems/isofs.txt create mode 100644 Documentation/filesystems/jfs.txt create mode 100644 Documentation/filesystems/ncpfs.txt create mode 100644 Documentation/filesystems/ntfs.txt create mode 100644 Documentation/filesystems/porting create mode 100644 Documentation/filesystems/proc.txt create mode 100644 Documentation/filesystems/romfs.txt create mode 100644 Documentation/filesystems/smbfs.txt create mode 100644 Documentation/filesystems/sysfs-pci.txt create mode 100644 Documentation/filesystems/sysfs.txt create mode 100644 Documentation/filesystems/sysv-fs.txt create mode 100644 Documentation/filesystems/tmpfs.txt create mode 100644 Documentation/filesystems/udf.txt create mode 100644 Documentation/filesystems/ufs.txt create mode 100644 Documentation/filesystems/vfat.txt create mode 100644 Documentation/filesystems/vfs.txt create mode 100644 Documentation/filesystems/xfs.txt create mode 100644 Documentation/firmware_class/README create mode 100644 Documentation/firmware_class/firmware_sample_driver.c create mode 100644 Documentation/firmware_class/firmware_sample_firmware_class.c create mode 100644 Documentation/firmware_class/hotplug-script create mode 100644 Documentation/floppy.txt create mode 100644 Documentation/ftape.txt create mode 100644 Documentation/fujitsu/frv/README.txt create mode 100644 Documentation/fujitsu/frv/atomic-ops.txt create mode 100644 Documentation/fujitsu/frv/booting.txt create mode 100644 Documentation/fujitsu/frv/clock.txt create mode 100644 Documentation/fujitsu/frv/configuring.txt create mode 100644 Documentation/fujitsu/frv/features.txt create mode 100644 Documentation/fujitsu/frv/gdbinit create mode 100644 Documentation/fujitsu/frv/gdbstub.txt create mode 100644 Documentation/fujitsu/frv/mmu-layout.txt create mode 100644 Documentation/hayes-esp.txt create mode 100644 Documentation/highuid.txt create mode 100644 Documentation/hpet.txt create mode 100644 Documentation/hw_random.txt create mode 100644 Documentation/i2c/busses/i2c-ali1535 create mode 100644 Documentation/i2c/busses/i2c-ali1563 create mode 100644 Documentation/i2c/busses/i2c-ali15x3 create mode 100644 Documentation/i2c/busses/i2c-amd756 create mode 100644 Documentation/i2c/busses/i2c-amd8111 create mode 100644 Documentation/i2c/busses/i2c-i801 create mode 100644 Documentation/i2c/busses/i2c-i810 create mode 100644 Documentation/i2c/busses/i2c-nforce2 create mode 100644 Documentation/i2c/busses/i2c-parport create mode 100644 Documentation/i2c/busses/i2c-parport-light create mode 100644 Documentation/i2c/busses/i2c-pca-isa create mode 100644 Documentation/i2c/busses/i2c-piix4 create mode 100644 Documentation/i2c/busses/i2c-prosavage create mode 100644 Documentation/i2c/busses/i2c-savage4 create mode 100644 Documentation/i2c/busses/i2c-sis5595 create mode 100644 Documentation/i2c/busses/i2c-sis630 create mode 100644 Documentation/i2c/busses/i2c-sis69x create mode 100644 Documentation/i2c/busses/i2c-via create mode 100644 Documentation/i2c/busses/i2c-viapro create mode 100644 Documentation/i2c/busses/i2c-voodoo3 create mode 100644 Documentation/i2c/busses/scx200_acb create mode 100644 Documentation/i2c/chips/smsc47b397.txt create mode 100644 Documentation/i2c/dev-interface create mode 100644 Documentation/i2c/functionality create mode 100644 Documentation/i2c/i2c-protocol create mode 100644 Documentation/i2c/i2c-stub create mode 100644 Documentation/i2c/porting-clients create mode 100644 Documentation/i2c/smbus-protocol create mode 100644 Documentation/i2c/summary create mode 100644 Documentation/i2c/sysfs-interface create mode 100644 Documentation/i2c/ten-bit-addresses create mode 100644 Documentation/i2c/writing-clients create mode 100644 Documentation/i2o/README create mode 100644 Documentation/i2o/ioctl create mode 100644 Documentation/i386/IO-APIC.txt create mode 100644 Documentation/i386/boot.txt create mode 100644 Documentation/i386/usb-legacy-support.txt create mode 100644 Documentation/i386/zero-page.txt create mode 100644 Documentation/ia64/IRQ-redir.txt create mode 100644 Documentation/ia64/README create mode 100644 Documentation/ia64/efirtc.txt create mode 100644 Documentation/ia64/fsys.txt create mode 100644 Documentation/ia64/serial.txt create mode 100644 Documentation/ibm-acpi.txt create mode 100644 Documentation/ide.txt create mode 100644 Documentation/infiniband/ipoib.txt create mode 100644 Documentation/infiniband/sysfs.txt create mode 100644 Documentation/infiniband/user_mad.txt create mode 100644 Documentation/initrd.txt create mode 100644 Documentation/input/amijoy.txt create mode 100644 Documentation/input/atarikbd.txt create mode 100644 Documentation/input/cd32.txt create mode 100644 Documentation/input/cs461x.txt create mode 100644 Documentation/input/ff.txt create mode 100644 Documentation/input/gameport-programming.txt create mode 100644 Documentation/input/iforce-protocol.txt create mode 100644 Documentation/input/input-programming.txt create mode 100644 Documentation/input/input.txt create mode 100644 Documentation/input/interactive.fig create mode 100644 Documentation/input/joystick-api.txt create mode 100644 Documentation/input/joystick-parport.txt create mode 100644 Documentation/input/joystick.txt create mode 100644 Documentation/input/shape.fig create mode 100644 Documentation/input/xpad.txt create mode 100644 Documentation/io_ordering.txt create mode 100644 Documentation/ioctl-number.txt create mode 100644 Documentation/ioctl/cdrom.txt create mode 100644 Documentation/ioctl/hdio.txt create mode 100644 Documentation/iostats.txt create mode 100644 Documentation/isapnp.txt create mode 100644 Documentation/isdn/00-INDEX create mode 100644 Documentation/isdn/CREDITS create mode 100644 Documentation/isdn/HiSax.cert create mode 100644 Documentation/isdn/INTERFACE create mode 100644 Documentation/isdn/INTERFACE.fax create mode 100644 Documentation/isdn/README create mode 100644 Documentation/isdn/README.FAQ create mode 100644 Documentation/isdn/README.HiSax create mode 100644 Documentation/isdn/README.act2000 create mode 100644 Documentation/isdn/README.audio create mode 100644 Documentation/isdn/README.avmb1 create mode 100644 Documentation/isdn/README.concap create mode 100644 Documentation/isdn/README.diversion create mode 100644 Documentation/isdn/README.fax create mode 100644 Documentation/isdn/README.hfc-pci create mode 100644 Documentation/isdn/README.hysdn create mode 100644 Documentation/isdn/README.icn create mode 100644 Documentation/isdn/README.pcbit create mode 100644 Documentation/isdn/README.sc create mode 100644 Documentation/isdn/README.syncppp create mode 100644 Documentation/isdn/README.x25 create mode 100644 Documentation/isdn/syncPPP.FAQ create mode 100644 Documentation/java.txt create mode 100644 Documentation/kbuild/00-INDEX create mode 100644 Documentation/kbuild/kconfig-language.txt create mode 100644 Documentation/kbuild/makefiles.txt create mode 100644 Documentation/kbuild/modules.txt create mode 100644 Documentation/kernel-doc-nano-HOWTO.txt create mode 100644 Documentation/kernel-docs.txt create mode 100644 Documentation/kernel-parameters.txt create mode 100644 Documentation/keys.txt create mode 100644 Documentation/kobject.txt create mode 100644 Documentation/laptop-mode.txt create mode 100644 Documentation/ldm.txt create mode 100644 Documentation/locks.txt create mode 100644 Documentation/logo.gif create mode 100644 Documentation/logo.txt create mode 100644 Documentation/m68k/00-INDEX create mode 100644 Documentation/m68k/README.buddha create mode 100644 Documentation/m68k/kernel-options.txt create mode 100644 Documentation/magic-number.txt create mode 100644 Documentation/mandatory.txt create mode 100644 Documentation/mca.txt create mode 100644 Documentation/md.txt create mode 100644 Documentation/memory.txt create mode 100644 Documentation/mips/GT64120.README create mode 100644 Documentation/mips/pci/pci.README create mode 100644 Documentation/mips/time.README create mode 100644 Documentation/mono.txt create mode 100644 Documentation/moxa-smartio create mode 100644 Documentation/mtrr.txt create mode 100644 Documentation/nbd.txt create mode 100644 Documentation/networking/00-INDEX create mode 100644 Documentation/networking/3c359.txt create mode 100644 Documentation/networking/3c505.txt create mode 100644 Documentation/networking/3c509.txt create mode 100644 Documentation/networking/6pack.txt create mode 100644 Documentation/networking/Configurable create mode 100644 Documentation/networking/DLINK.txt create mode 100644 Documentation/networking/NAPI_HOWTO.txt create mode 100644 Documentation/networking/PLIP.txt create mode 100644 Documentation/networking/README.sb1000 create mode 100644 Documentation/networking/TODO create mode 100644 Documentation/networking/alias.txt create mode 100644 Documentation/networking/arcnet-hardware.txt create mode 100644 Documentation/networking/arcnet.txt create mode 100644 Documentation/networking/atm.txt create mode 100644 Documentation/networking/ax25.txt create mode 100644 Documentation/networking/baycom.txt create mode 100644 Documentation/networking/bonding.txt create mode 100644 Documentation/networking/bridge.txt create mode 100644 Documentation/networking/comx.txt create mode 100644 Documentation/networking/cops.txt create mode 100644 Documentation/networking/cs89x0.txt create mode 100644 Documentation/networking/de4x5.txt create mode 100644 Documentation/networking/decnet.txt create mode 100644 Documentation/networking/depca.txt create mode 100644 Documentation/networking/dgrs.txt create mode 100644 Documentation/networking/dl2k.txt create mode 100644 Documentation/networking/dmfe.txt create mode 100644 Documentation/networking/driver.txt create mode 100644 Documentation/networking/e100.txt create mode 100644 Documentation/networking/e1000.txt create mode 100644 Documentation/networking/eql.txt create mode 100644 Documentation/networking/ewrk3.txt create mode 100644 Documentation/networking/filter.txt create mode 100644 Documentation/networking/fore200e.txt create mode 100644 Documentation/networking/framerelay.txt create mode 100644 Documentation/networking/gen_stats.txt create mode 100644 Documentation/networking/generic-hdlc.txt create mode 100644 Documentation/networking/ifenslave.c create mode 100644 Documentation/networking/ip-sysctl.txt create mode 100644 Documentation/networking/ip_dynaddr.txt create mode 100644 Documentation/networking/ipddp.txt create mode 100644 Documentation/networking/iphase.txt create mode 100644 Documentation/networking/irda.txt create mode 100644 Documentation/networking/ixgb.txt create mode 100644 Documentation/networking/lapb-module.txt create mode 100644 Documentation/networking/ltpc.txt create mode 100644 Documentation/networking/multicast.txt create mode 100644 Documentation/networking/ncsa-telnet create mode 100644 Documentation/networking/net-modules.txt create mode 100644 Documentation/networking/netconsole.txt create mode 100644 Documentation/networking/netdevices.txt create mode 100644 Documentation/networking/netif-msg.txt create mode 100644 Documentation/networking/olympic.txt create mode 100644 Documentation/networking/packet_mmap.txt create mode 100644 Documentation/networking/pktgen.txt create mode 100644 Documentation/networking/policy-routing.txt create mode 100644 Documentation/networking/ppp_generic.txt create mode 100644 Documentation/networking/proc_net_tcp.txt create mode 100644 Documentation/networking/pt.txt create mode 100644 Documentation/networking/ray_cs.txt create mode 100644 Documentation/networking/routing.txt create mode 100644 Documentation/networking/s2io.txt create mode 100644 Documentation/networking/sctp.txt create mode 100644 Documentation/networking/shaper.txt create mode 100644 Documentation/networking/sis900.txt create mode 100644 Documentation/networking/sk98lin.txt create mode 100644 Documentation/networking/skfp.txt create mode 100644 Documentation/networking/slicecom.hun create mode 100644 Documentation/networking/slicecom.txt create mode 100644 Documentation/networking/smc9.txt create mode 100644 Documentation/networking/smctr.txt create mode 100644 Documentation/networking/tcp.txt create mode 100644 Documentation/networking/tlan.txt create mode 100644 Documentation/networking/tms380tr.txt create mode 100644 Documentation/networking/tuntap.txt create mode 100644 Documentation/networking/vortex.txt create mode 100644 Documentation/networking/wan-router.txt create mode 100644 Documentation/networking/wanpipe.txt create mode 100644 Documentation/networking/wavelan.txt create mode 100644 Documentation/networking/x25-iface.txt create mode 100644 Documentation/networking/x25.txt create mode 100644 Documentation/networking/z8530drv.txt create mode 100644 Documentation/nfsroot.txt create mode 100644 Documentation/nmi_watchdog.txt create mode 100644 Documentation/nommu-mmap.txt create mode 100644 Documentation/numastat.txt create mode 100644 Documentation/oops-tracing.txt create mode 100644 Documentation/paride.txt create mode 100644 Documentation/parisc/00-INDEX create mode 100644 Documentation/parisc/debugging create mode 100644 Documentation/parisc/registers create mode 100644 Documentation/parport-lowlevel.txt create mode 100644 Documentation/parport.txt create mode 100644 Documentation/pci.txt create mode 100644 Documentation/pm.txt create mode 100644 Documentation/pnp.txt create mode 100644 Documentation/power/devices.txt create mode 100644 Documentation/power/interface.txt create mode 100644 Documentation/power/kernel_threads.txt create mode 100644 Documentation/power/pci.txt create mode 100644 Documentation/power/states.txt create mode 100644 Documentation/power/swsusp.txt create mode 100644 Documentation/power/tricks.txt create mode 100644 Documentation/power/video.txt create mode 100644 Documentation/power/video_extension.txt create mode 100644 Documentation/powerpc/00-INDEX create mode 100644 Documentation/powerpc/SBC8260_memory_mapping.txt create mode 100644 Documentation/powerpc/cpu_features.txt create mode 100644 Documentation/powerpc/eeh-pci-error-recovery.txt create mode 100644 Documentation/powerpc/hvcs.txt create mode 100644 Documentation/powerpc/mpc52xx.txt create mode 100644 Documentation/powerpc/ppc_htab.txt create mode 100644 Documentation/powerpc/smp.txt create mode 100644 Documentation/powerpc/sound.txt create mode 100644 Documentation/powerpc/zImage_layout.txt create mode 100644 Documentation/preempt-locking.txt create mode 100644 Documentation/prio_tree.txt create mode 100644 Documentation/ramdisk.txt create mode 100644 Documentation/riscom8.txt create mode 100644 Documentation/rocket.txt create mode 100644 Documentation/rpc-cache.txt create mode 100644 Documentation/rtc.txt create mode 100644 Documentation/s390/3270.ChangeLog create mode 100644 Documentation/s390/3270.txt create mode 100644 Documentation/s390/CommonIO create mode 100644 Documentation/s390/DASD create mode 100644 Documentation/s390/Debugging390.txt create mode 100644 Documentation/s390/TAPE create mode 100644 Documentation/s390/cds.txt create mode 100644 Documentation/s390/config3270.sh create mode 100644 Documentation/s390/crypto/crypto-API.txt create mode 100644 Documentation/s390/driver-model.txt create mode 100644 Documentation/s390/monreader.txt create mode 100644 Documentation/s390/s390dbf.txt create mode 100644 Documentation/sched-coding.txt create mode 100644 Documentation/sched-design.txt create mode 100644 Documentation/sched-domains.txt create mode 100644 Documentation/sched-stats.txt create mode 100644 Documentation/scsi/00-INDEX create mode 100644 Documentation/scsi/53c700.txt create mode 100644 Documentation/scsi/BusLogic.txt create mode 100644 Documentation/scsi/ChangeLog.1992-1997 create mode 100644 Documentation/scsi/ChangeLog.ips create mode 100644 Documentation/scsi/ChangeLog.megaraid create mode 100644 Documentation/scsi/ChangeLog.ncr53c8xx create mode 100644 Documentation/scsi/ChangeLog.sym53c8xx create mode 100644 Documentation/scsi/ChangeLog.sym53c8xx_2 create mode 100644 Documentation/scsi/FlashPoint.txt create mode 100644 Documentation/scsi/LICENSE.FlashPoint create mode 100644 Documentation/scsi/Mylex.txt create mode 100644 Documentation/scsi/NinjaSCSI.txt create mode 100644 Documentation/scsi/aha152x.txt create mode 100644 Documentation/scsi/aic79xx.txt create mode 100644 Documentation/scsi/aic7xxx.txt create mode 100644 Documentation/scsi/aic7xxx_old.txt create mode 100644 Documentation/scsi/cpqfc.txt create mode 100644 Documentation/scsi/dc395x.txt create mode 100644 Documentation/scsi/dpti.txt create mode 100644 Documentation/scsi/dtc3x80.txt create mode 100644 Documentation/scsi/g_NCR5380.txt create mode 100644 Documentation/scsi/ibmmca.txt create mode 100644 Documentation/scsi/in2000.txt create mode 100644 Documentation/scsi/megaraid.txt create mode 100644 Documentation/scsi/ncr53c7xx.txt create mode 100644 Documentation/scsi/ncr53c8xx.txt create mode 100644 Documentation/scsi/osst.txt create mode 100644 Documentation/scsi/ppa.txt create mode 100644 Documentation/scsi/qla2xxx.revision.notes create mode 100644 Documentation/scsi/qlogicfas.txt create mode 100644 Documentation/scsi/qlogicisp.txt create mode 100644 Documentation/scsi/scsi-generic.txt create mode 100644 Documentation/scsi/scsi.txt create mode 100644 Documentation/scsi/scsi_mid_low_api.txt create mode 100644 Documentation/scsi/st.txt create mode 100644 Documentation/scsi/sym53c500_cs.txt create mode 100644 Documentation/scsi/sym53c8xx_2.txt create mode 100644 Documentation/scsi/tmscsim.txt create mode 100644 Documentation/seclvl.txt create mode 100644 Documentation/serial-console.txt create mode 100644 Documentation/serial/driver create mode 100644 Documentation/sgi-visws.txt create mode 100644 Documentation/sh/kgdb.txt create mode 100644 Documentation/sh/new-machine.txt create mode 100644 Documentation/smart-config.txt create mode 100644 Documentation/smp.txt create mode 100644 Documentation/sonypi.txt create mode 100644 Documentation/sound/alsa/ALSA-Configuration.txt create mode 100644 Documentation/sound/alsa/Audigy-mixer.txt create mode 100644 Documentation/sound/alsa/Bt87x.txt create mode 100644 Documentation/sound/alsa/CMIPCI.txt create mode 100644 Documentation/sound/alsa/ControlNames.txt create mode 100644 Documentation/sound/alsa/DocBook/alsa-driver-api.tmpl create mode 100644 Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl create mode 100644 Documentation/sound/alsa/Joystick.txt create mode 100644 Documentation/sound/alsa/MIXART.txt create mode 100644 Documentation/sound/alsa/OSS-Emulation.txt create mode 100644 Documentation/sound/alsa/Procfile.txt create mode 100644 Documentation/sound/alsa/SB-Live-mixer.txt create mode 100644 Documentation/sound/alsa/VIA82xx-mixer.txt create mode 100644 Documentation/sound/alsa/hda_codec.txt create mode 100644 Documentation/sound/alsa/seq_oss.html create mode 100644 Documentation/sound/alsa/serial-u16550.txt create mode 100644 Documentation/sound/oss/AD1816 create mode 100644 Documentation/sound/oss/ALS create mode 100644 Documentation/sound/oss/AWE32 create mode 100644 Documentation/sound/oss/AudioExcelDSP16 create mode 100644 Documentation/sound/oss/CMI8330 create mode 100644 Documentation/sound/oss/CMI8338 create mode 100644 Documentation/sound/oss/CS4232 create mode 100644 Documentation/sound/oss/ESS create mode 100644 Documentation/sound/oss/ESS1868 create mode 100644 Documentation/sound/oss/INSTALL.awe create mode 100644 Documentation/sound/oss/Introduction create mode 100644 Documentation/sound/oss/MAD16 create mode 100644 Documentation/sound/oss/Maestro create mode 100644 Documentation/sound/oss/Maestro3 create mode 100644 Documentation/sound/oss/MultiSound create mode 100644 Documentation/sound/oss/NEWS create mode 100644 Documentation/sound/oss/NM256 create mode 100644 Documentation/sound/oss/OPL3 create mode 100644 Documentation/sound/oss/OPL3-SA create mode 100644 Documentation/sound/oss/OPL3-SA2 create mode 100644 Documentation/sound/oss/Opti create mode 100644 Documentation/sound/oss/PAS16 create mode 100644 Documentation/sound/oss/PSS create mode 100644 Documentation/sound/oss/PSS-updates create mode 100644 Documentation/sound/oss/README.OSS create mode 100644 Documentation/sound/oss/README.awe create mode 100644 Documentation/sound/oss/README.modules create mode 100644 Documentation/sound/oss/README.ymfsb create mode 100644 Documentation/sound/oss/SoundPro create mode 100644 Documentation/sound/oss/Soundblaster create mode 100644 Documentation/sound/oss/Tropez+ create mode 100644 Documentation/sound/oss/VIA-chipset create mode 100644 Documentation/sound/oss/VIBRA16 create mode 100644 Documentation/sound/oss/WaveArtist create mode 100644 Documentation/sound/oss/Wavefront create mode 100644 Documentation/sound/oss/btaudio create mode 100644 Documentation/sound/oss/cs46xx create mode 100644 Documentation/sound/oss/es1370 create mode 100644 Documentation/sound/oss/es1371 create mode 100644 Documentation/sound/oss/mwave create mode 100644 Documentation/sound/oss/rme96xx create mode 100644 Documentation/sound/oss/solo1 create mode 100644 Documentation/sound/oss/sonicvibes create mode 100644 Documentation/sound/oss/ultrasound create mode 100644 Documentation/sound/oss/vwsnd create mode 100644 Documentation/sparc/README-2.5 create mode 100644 Documentation/sparc/sbus_drivers.txt create mode 100644 Documentation/sparse.txt create mode 100644 Documentation/specialix.txt create mode 100644 Documentation/spinlocks.txt create mode 100644 Documentation/stable_api_nonsense.txt create mode 100644 Documentation/stallion.txt create mode 100644 Documentation/svga.txt create mode 100644 Documentation/sx.txt create mode 100644 Documentation/sysctl/README create mode 100644 Documentation/sysctl/abi.txt create mode 100644 Documentation/sysctl/fs.txt create mode 100644 Documentation/sysctl/kernel.txt create mode 100644 Documentation/sysctl/sunrpc.txt create mode 100644 Documentation/sysctl/vm.txt create mode 100644 Documentation/sysrq.txt create mode 100644 Documentation/telephony/ixj.txt create mode 100644 Documentation/time_interpolators.txt create mode 100644 Documentation/tipar.txt create mode 100644 Documentation/tty.txt create mode 100644 Documentation/uml/UserModeLinux-HOWTO.txt create mode 100644 Documentation/unicode.txt create mode 100644 Documentation/usb/CREDITS create mode 100644 Documentation/usb/URB.txt create mode 100644 Documentation/usb/acm.txt create mode 100644 Documentation/usb/auerswald.txt create mode 100644 Documentation/usb/bluetooth.txt create mode 100644 Documentation/usb/dma.txt create mode 100644 Documentation/usb/ehci.txt create mode 100644 Documentation/usb/error-codes.txt create mode 100644 Documentation/usb/gadget_serial.txt create mode 100644 Documentation/usb/hiddev.txt create mode 100644 Documentation/usb/hotplug.txt create mode 100644 Documentation/usb/ibmcam.txt create mode 100644 Documentation/usb/linux.inf create mode 100644 Documentation/usb/mtouchusb.txt create mode 100644 Documentation/usb/ohci.txt create mode 100644 Documentation/usb/ov511.txt create mode 100644 Documentation/usb/proc_usb_info.txt create mode 100644 Documentation/usb/rio.txt create mode 100644 Documentation/usb/se401.txt create mode 100644 Documentation/usb/sn9c102.txt create mode 100644 Documentation/usb/stv680.txt create mode 100644 Documentation/usb/uhci.txt create mode 100644 Documentation/usb/usb-help.txt create mode 100644 Documentation/usb/usb-serial.txt create mode 100644 Documentation/usb/usbmon.txt create mode 100644 Documentation/usb/w9968cf.txt create mode 100644 Documentation/video4linux/API.html create mode 100644 Documentation/video4linux/CARDLIST.bttv create mode 100644 Documentation/video4linux/CARDLIST.saa7134 create mode 100644 Documentation/video4linux/CARDLIST.tuner create mode 100644 Documentation/video4linux/CQcam.txt create mode 100644 Documentation/video4linux/README.cpia create mode 100644 Documentation/video4linux/README.cx88 create mode 100644 Documentation/video4linux/README.ir create mode 100644 Documentation/video4linux/README.saa7134 create mode 100644 Documentation/video4linux/Zoran create mode 100644 Documentation/video4linux/bttv/CONTRIBUTORS create mode 100644 Documentation/video4linux/bttv/Cards create mode 100644 Documentation/video4linux/bttv/ICs create mode 100644 Documentation/video4linux/bttv/Insmod-options create mode 100644 Documentation/video4linux/bttv/MAKEDEV create mode 100644 Documentation/video4linux/bttv/Modprobe.conf create mode 100644 Documentation/video4linux/bttv/Modules.conf create mode 100644 Documentation/video4linux/bttv/PROBLEMS create mode 100644 Documentation/video4linux/bttv/README create mode 100644 Documentation/video4linux/bttv/README.WINVIEW create mode 100644 Documentation/video4linux/bttv/README.freeze create mode 100644 Documentation/video4linux/bttv/README.quirks create mode 100644 Documentation/video4linux/bttv/Sound-FAQ create mode 100644 Documentation/video4linux/bttv/Specs create mode 100644 Documentation/video4linux/bttv/THANKS create mode 100644 Documentation/video4linux/bttv/Tuners create mode 100644 Documentation/video4linux/meye.txt create mode 100644 Documentation/video4linux/radiotrack.txt create mode 100644 Documentation/video4linux/w9966.txt create mode 100644 Documentation/video4linux/zr36120.txt create mode 100644 Documentation/vm/balance create mode 100644 Documentation/vm/hugetlbpage.txt create mode 100644 Documentation/vm/locking create mode 100644 Documentation/vm/numa create mode 100644 Documentation/vm/overcommit-accounting create mode 100644 Documentation/voyager.txt create mode 100644 Documentation/w1/w1.generic create mode 100644 Documentation/watchdog/pcwd-watchdog.txt create mode 100644 Documentation/watchdog/watchdog-api.txt create mode 100644 Documentation/watchdog/watchdog.txt create mode 100644 Documentation/x86_64/boot-options.txt create mode 100644 Documentation/x86_64/mm.txt create mode 100644 Documentation/xterm-linux.xpm create mode 100644 Documentation/zorro.txt (limited to 'Documentation') diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX new file mode 100644 index 000000000000..72dc90f8f4a7 --- /dev/null +++ b/Documentation/00-INDEX @@ -0,0 +1,294 @@ + +This is a brief list of all the files in ./linux/Documentation and what +they contain. If you add a documentation file, please list it here in +alphabetical order as well, or risk being hunted down like a rabid dog. +Please try and keep the descriptions small enough to fit on one line. + Thanks -- Paul G. + +Following translations are available on the WWW: + + - Japanese, maintained by the JF Project (JF@linux.or.jp), at + http://www.linux.or.jp/JF/ + +00-INDEX + - this file. +BK-usage/ + - directory with info on BitKeeper. +BUG-HUNTING + - brute force method of doing binary search of patches to find bug. +Changes + - list of changes that break older software packages. +CodingStyle + - how the boss likes the C code in the kernel to look. +DMA-API.txt + - DMA API, pci_ API & extensions for non-consistent memory machines. +DMA-mapping.txt + - info for PCI drivers using DMA portably across all platforms. +DocBook/ + - directory with DocBook templates etc. for kernel documentation. +IO-mapping.txt + - how to access I/O mapped memory from within device drivers. +IPMI.txt + - info on Linux Intelligent Platform Management Interface (IPMI) Driver. +IRQ-affinity.txt + - how to select which CPU(s) handle which interrupt events on SMP. +ManagementStyle + - how to (attempt to) manage kernel hackers. +MSI-HOWTO.txt + - the Message Signaled Interrupts (MSI) Driver Guide HOWTO and FAQ. +RCU/ + - directory with info on RCU (read-copy update). +README.DAC960 + - info on Mylex DAC960/DAC1100 PCI RAID Controller Driver for Linux. +SAK.txt + - info on Secure Attention Keys. +SubmittingDrivers + - procedure to get a new driver source included into the kernel tree. +SubmittingPatches + - procedure to get a source patch included into the kernel tree. +VGA-softcursor.txt + - how to change your VGA cursor from a blinking underscore. +arm/ + - directory with info about Linux on the ARM architecture. +basic_profiling.txt + - basic instructions for those who wants to profile Linux kernel. +binfmt_misc.txt + - info on the kernel support for extra binary formats. +block/ + - info on the Block I/O (BIO) layer. +cachetlb.txt + - describes the cache/TLB flushing interfaces Linux uses. +cciss.txt + - info, major/minor #'s for Compaq's SMART Array Controllers. +cdrom/ + - directory with information on the CD-ROM drivers that Linux has. +cli-sti-removal.txt + - cli()/sti() removal guide. +computone.txt + - info on Computone Intelliport II/Plus Multiport Serial Driver. +cpqarray.txt + - info on using Compaq's SMART2 Intelligent Disk Array Controllers. +cpu-freq/ + - info on CPU frequency and voltage scaling. +cris/ + - directory with info about Linux on CRIS architecture. +crypto/ + - directory with info on the Crypto API. +debugging-modules.txt + - some notes on debugging modules after Linux 2.6.3. +device-mapper/ + - directory with info on Device Mapper. +devices.txt + - plain ASCII listing of all the nodes in /dev/ with major minor #'s. +digiepca.txt + - info on Digi Intl. {PC,PCI,EISA}Xx and Xem series cards. +dnotify.txt + - info about directory notification in Linux. +driver-model/ + - directory with info about Linux driver model. +dvb/ + - info on Linux Digital Video Broadcast (DVB) subsystem. +early-userspace/ + - info about initramfs, klibc, and userspace early during boot. +eisa.txt + - info on EISA bus support. +exception.txt + - how Linux v2.2 handles exceptions without verify_area etc. +fb/ + - directory with info on the frame buffer graphics abstraction layer. +filesystems/ + - directory with info on the various filesystems that Linux supports. +firmware_class/ + - request_firmware() hotplug interface info. +floppy.txt + - notes and driver options for the floppy disk driver. +ftape.txt + - notes about the floppy tape device driver. +hayes-esp.txt + - info on using the Hayes ESP serial driver. +highuid.txt + - notes on the change from 16 bit to 32 bit user/group IDs. +hpet.txt + - High Precision Event Timer Driver for Linux. +hw_random.txt + - info on Linux support for random number generator in i8xx chipsets. +i2c/ + - directory with info about the I2C bus/protocol (2 wire, kHz speed). +i2o/ + - directory with info about the Linux I2O subsystem. +i386/ + - directory with info about Linux on Intel 32 bit architecture. +ia64/ + - directory with info about Linux on Intel 64 bit architecture. +ide.txt + - important info for users of ATA devices (IDE/EIDE disks and CD-ROMS). +initrd.txt + - how to use the RAM disk as an initial/temporary root filesystem. +input/ + - info on Linux input device support. +io_ordering.txt + - info on ordering I/O writes to memory-mapped addresses. +ioctl-number.txt + - how to implement and register device/driver ioctl calls. +iostats.txt + - info on I/O statistics Linux kernel provides. +isapnp.txt + - info on Linux ISA Plug & Play support. +isdn/ + - directory with info on the Linux ISDN support, and supported cards. +java.txt + - info on the in-kernel binary support for Java(tm). +kbuild/ + - directory with info about the kernel build process. +kernel-doc-nano-HOWTO.txt + - mini HowTo on generation and location of kernel documentation files. +kernel-docs.txt + - listing of various WWW + books that document kernel internals. +kernel-parameters.txt + - summary listing of command line / boot prompt args for the kernel. +kobject.txt + - info of the kobject infrastructure of the Linux kernel. +laptop-mode.txt + - How to conserve battery power using laptop-mode. +ldm.txt + - a brief description of LDM (Windows Dynamic Disks). +locks.txt + - info on file locking implementations, flock() vs. fcntl(), etc. +logo.gif + - Full colour GIF image of Linux logo (penguin). +logo.txt + - Info on creator of above logo & site to get additional images from. +m68k/ + - directory with info about Linux on Motorola 68k architecture. +magic-number.txt + - list of magic numbers used to mark/protect kernel data structures. +mandatory.txt + - info on the Linux implementation of Sys V mandatory file locking. +mca.txt + - info on supporting Micro Channel Architecture (e.g. PS/2) systems. +md.txt + - info on boot arguments for the multiple devices driver. +memory.txt + - info on typical Linux memory problems. +mips/ + - directory with info about Linux on MIPS architecture. +mono.txt + - how to execute Mono-based .NET binaries with the help of BINFMT_MISC. +moxa-smartio + - info on installing/using Moxa multiport serial driver. +mtrr.txt + - how to use PPro Memory Type Range Registers to increase performance. +nbd.txt + - info on a TCP implementation of a network block device. +networking/ + - directory with info on various aspects of networking with Linux. +nfsroot.txt + - short guide on setting up a diskless box with NFS root filesystem. +nmi_watchdog.txt + - info on NMI watchdog for SMP systems. +numastat.txt + - info on how to read Numa policy hit/miss statistics in sysfs. +oops-tracing.txt + - how to decode those nasty internal kernel error dump messages. +paride.txt + - information about the parallel port IDE subsystem. +parisc/ + - directory with info on using Linux on PA-RISC architecture. +parport.txt + - how to use the parallel-port driver. +parport-lowlevel.txt + - description and usage of the low level parallel port functions. +pci.txt + - info on the PCI subsystem for device driver authors. +pm.txt + - info on Linux power management support. +pnp.txt + - Linux Plug and Play documentation. +power/ + - directory with info on Linux PCI power management. +powerpc/ + - directory with info on using Linux with the PowerPC. +preempt-locking.txt + - info on locking under a preemptive kernel. +ramdisk.txt + - short guide on how to set up and use the RAM disk. +riscom8.txt + - notes on using the RISCom/8 multi-port serial driver. +rocket.txt + - info on the Comtrol RocketPort multiport serial driver. +rpc-cache.txt + - introduction to the caching mechanisms in the sunrpc layer. +rtc.txt + - notes on how to use the Real Time Clock (aka CMOS clock) driver. +s390/ + - directory with info on using Linux on the IBM S390. +sched-coding.txt + - reference for various scheduler-related methods in the O(1) scheduler. +sched-design.txt + - goals, design and implementation of the Linux O(1) scheduler. +sched-domains.txt + - information on scheduling domains. +sched-stats.txt + - information on schedstats (Linux Scheduler Statistics). +scsi/ + - directory with info on Linux scsi support. +serial/ + - directory with info on the low level serial API. +serial-console.txt + - how to set up Linux with a serial line console as the default. +sgi-visws.txt + - short blurb on the SGI Visual Workstations. +sh/ + - directory with info on porting Linux to a new architecture. +smart-config.txt + - description of the Smart Config makefile feature. +smp.txt + - a few notes on symmetric multi-processing. +sonypi.txt + - info on Linux Sony Programmable I/O Device support. +sound/ + - directory with info on sound card support. +sparc/ + - directory with info on using Linux on Sparc architecture. +specialix.txt + - info on hardware/driver for specialix IO8+ multiport serial card. +spinlocks.txt + - info on using spinlocks to provide exclusive access in kernel. +stallion.txt + - info on using the Stallion multiport serial driver. +svga.txt + - short guide on selecting video modes at boot via VGA BIOS. +sx.txt + - info on the Specialix SX/SI multiport serial driver. +sysctl/ + - directory with info on the /proc/sys/* files. +sysrq.txt + - info on the magic SysRq key. +telephony/ + - directory with info on telephony (e.g. voice over IP) support. +time_interpolators.txt + - info on time interpolators. +tipar.txt + - information about Parallel link cable for Texas Instruments handhelds. +tty.txt + - guide to the locking policies of the tty layer. +unicode.txt + - info on the Unicode character/font mapping used in Linux. +uml/ + - directory with infomation about User Mode Linux. +usb/ + - directory with info regarding the Universal Serial Bus. +video4linux/ + - directory with info regarding video/TV/radio cards and linux. +vm/ + - directory with info on the Linux vm code. +voyager.txt + - guide to running Linux on the Voyager architecture. +watchdog/ + - how to auto-reboot Linux if it has "fallen and can't get up". ;-) +x86_64/ + - directory with info on Linux support for AMD x86-64 (Hammer) machines. +xterm-linux.xpm + - XPM image of penguin logo (see logo.txt) sitting on an xterm. +zorro.txt + - info on writing drivers for Zorro bus devices found on Amigas. diff --git a/Documentation/BK-usage/00-INDEX b/Documentation/BK-usage/00-INDEX new file mode 100644 index 000000000000..82768784ea52 --- /dev/null +++ b/Documentation/BK-usage/00-INDEX @@ -0,0 +1,51 @@ +bk-kernel-howto.txt: Description of kernel workflow under BitKeeper + +bk-make-sum: Create summary of changesets in one repository and not +another, typically in preparation to be sent to an upstream maintainer. +Typical usage: + cd my-updated-repo + bk-make-sum ~/repo/original-repo + mv /tmp/linus.txt ../original-repo.txt + +bksend: Create readable text output containing summary of changes, GNU +patch of the changes, and BK metadata of changes (as needed for proper +importing into BitKeeper by an upstream maintainer). This output is +suitable for emailing BitKeeper changes. The recipient of this output +may pipe it directly to 'bk receive'. + +bz64wrap: helper script. Uncompressed input is piped to this script, +which compresses its input, and then outputs the uu-/base64-encoded +version of the compressed input. + +cpcset: Copy changeset between unrelated repositories. +Attempts to preserve changeset user, user address, description, in +addition to the changeset (the patch) itself. +Typical usage: + cd my-updated-repo + bk changes # looking for a changeset... + cpcset 1.1511 . ../another-repo + +csets-to-patches: Produces a delta of two BK repositories, in the form +of individual files, each containing a single cset as a GNU patch. +Output is several files, each with the filename "/tmp/rev-$REV.patch" +Typical usage: + cd my-updated-repo + bk changes -L ~/repo/original-repo 2>&1 | \ + perl csets-to-patches + +cset-to-linus: Produces a delta of two BK repositories, in the form of +changeset descriptions, with 'diffstat' output created for each +individual changset. +Typical usage: + cd my-updated-repo + bk changes -L ~/repo/original-repo 2>&1 | \ + perl cset-to-linus > summary.txt + +gcapatch: Generates patch containing changes in local repository. +Typical usage: + cd my-updated-repo + gcapatch > foo.patch + +unbz64wrap: Reverse an encoded, compressed data stream created by +bz64wrap into an uncompressed, typically text/plain output. + diff --git a/Documentation/BK-usage/bk-kernel-howto.txt b/Documentation/BK-usage/bk-kernel-howto.txt new file mode 100644 index 000000000000..b7b9075d2910 --- /dev/null +++ b/Documentation/BK-usage/bk-kernel-howto.txt @@ -0,0 +1,283 @@ + + Doing the BK Thing, Penguin-Style + + + + +This set of notes is intended mainly for kernel developers, occasional +or full-time, but sysadmins and power users may find parts of it useful +as well. It assumes at least a basic familiarity with CVS, both at a +user level (use on the cmd line) and at a higher level (client-server model). +Due to the author's background, an operation may be described in terms +of CVS, or in terms of how that operation differs from CVS. + +This is -not- intended to be BitKeeper documentation. Always run +"bk help " or in X "bk helptool " for reference +documentation. + + +BitKeeper Concepts +------------------ + +In the true nature of the Internet itself, BitKeeper is a distributed +system. When applied to revision control, this means doing away with +client-server, and changing to a parent-child model... essentially +peer-to-peer. On the developer's end, this also represents a +fundamental disruption in the standard workflow of changes, commits, +and merges. You will need to take a few minutes to think about +how to best work under BitKeeper, and re-optimize things a bit. +In some sense it is a bit radical, because it might described as +tossing changes out into a maelstrom and having them magically +land at the right destination... but I'm getting ahead of myself. + +Let's start with this progression: +Each BitKeeper source tree on disk is a repository unto itself. +Each repository has a parent (except the root/original, of course). +Each repository contains a set of a changesets ("csets"). +Each cset is one or more changed files, bundled together. + +Each tree is a repository, so all changes are checked into the local +tree. When a change is checked in, all modified files are grouped +into a logical unit, the changeset. Internally, BK links these +changesets in a tree, representing various converging and diverging +lines of development. These changesets are the bread and butter of +the BK system. + +After the concept of changesets, the next thing you need to get used +to is having multiple copies of source trees lying around. This -really- +takes some getting used to, for some people. Separate source trees +are the means in BitKeeper by which you delineate parallel lines +of development, both minor and major. What would be branches in +CVS become separate source trees, or "clones" in BitKeeper [heh, +or Star Wars] terminology. + +Clones and changesets are the tools from which most of the power of +BitKeeper is derived. As mentioned earlier, each clone has a parent, +the tree used as the source when the new clone was created. In a +CVS-like setup, the parent would be a remote server on the Internet, +and the child is your local clone of that tree. + +Once you have established a common baseline between two source trees -- +a common parent -- then you can merge changesets between those two +trees with ease. Merging changes into a tree is called a "pull", and +is analagous to 'cvs update'. A pull downloads all the changesets in +the remote tree you do not have, and merges them. Sending changes in +one tree to another tree is called a "push". Push sends all changes +in the local tree the remote does not yet have, and merges them. + +From these concepts come some initial command examples: + +1) bk clone -q http://linux.bkbits.net/linux-2.5 linus-2.5 +Download a 2.5 stock kernel tree, naming it "linus-2.5" in the local dir. +The "-q" disables listing every single file as it is downloaded. + +2) bk clone -ql linus-2.5 alpha-2.5 +Create a separate source tree for the Alpha AXP architecture. +The "-l" uses hard links instead of copying data, since both trees are +on the local disk. You can also replace the above with "bk lclone -q ..." + +You only clone a tree -once-. After cloning the tree lives a long time +on disk, being updating by pushes and pulls. + +3) cd alpha-2.5 ; bk pull http://gkernel.bkbits.net/alpha-2.5 +Download changes in "alpha-2.5" repository which are not present +in the local repository, and merge them into the source tree. + +4) bk -r co -q +Because every tree is a repository, files must be checked out before +they will be in their standard places in the source tree. + +5) bk vi fs/inode.c # example change... + bk citool # checkin, using X tool + bk push bk://gkernel@bkbits.net/alpha-2.5 # upload change +Typical example of a BK sequence that would replace the analagous CVS +situation, + vi fs/inode.c + cvs commit + +As this is just supposed to be a quick BK intro, for more in-depth +tutorials, live working demos, and docs, see http://www.bitkeeper.com/ + + + +BK and Kernel Development Workflow +---------------------------------- +Currently the latest 2.5 tree is available via "bk clone $URL" +and "bk pull $URL" at http://linux.bkbits.net/linux-2.5 +This should change in a few weeks to a kernel.org URL. + + +A big part of using BitKeeper is organizing the various trees you have +on your local disk, and organizing the flow of changes among those +trees, and remote trees. If one were to graph the relationships between +a desired BK setup, you are likely to see a few-many-few graph, like +this: + + linux-2.5 + | + merge-to-linus-2.5 + / | | + / | | + vm-hacks bugfixes filesys personal-hacks + \ | | / + \ | | / + \ | | / + testing-and-validation + +Since a "bk push" sends all changes not in the target tree, and +since a "bk pull" receives all changes not in the source tree, you want +to make sure you are only pushing specific changes to the desired tree, +not all changes from "peer parent" trees. For example, pushing a change +from the testing-and-validation tree would probably be a bad idea, +because it will push all changes from vm-hacks, bugfixes, filesys, and +personal-hacks trees into the target tree. + +One would typically work on only one "theme" at a time, either +vm-hacks or bugfixes or filesys, keeping those changes isolated in +their own tree during development, and only merge the isolated with +other changes when going upstream (to Linus or other maintainers) or +downstream (to your "union" trees, like testing-and-validation above). + +It should be noted that some of this separation is not just recommended +practice, it's actually [for now] -enforced- by BitKeeper. BitKeeper +requires that changesets maintain a certain order, which is the reason +that "bk push" sends all local changesets the remote doesn't have. This +separation may look like a lot of wasted disk space at first, but it +helps when two unrelated changes may "pollute" the same area of code, or +don't follow the same pace of development, or any other of the standard +reasons why one creates a development branch. + +Small development branches (clones) will appear and disappear: + + -------- A --------- B --------- C --------- D ------- + \ / + -----short-term devel branch----- + +While long-term branches will parallel a tree (or trees), with period +merge points. In this first example, we pull from a tree (pulls, +"\") periodically, such as what occurs when tracking changes in a +vendor tree, never pushing changes back up the line: + + -------- A --------- B --------- C --------- D ------- + \ \ \ + ----long-term devel branch----------------- + +And then a more common case in Linux kernel development, a long term +branch with periodic merges back into the tree (pushes, "/"): + + -------- A --------- B --------- C --------- D ------- + \ \ / \ + ----long-term devel branch----------------- + + + + + +Submitting Changes to Linus +--------------------------- +There's a bit of an art, or style, of submitting changes to Linus. +Since Linus's tree is now (you might say) fully integrated into the +distributed BitKeeper system, there are several prerequisites to +properly submitting a BitKeeper change. All these prereq's are just +general cleanliness of BK usage, so as people become experts at BK, feel +free to optimize this process further (assuming Linus agrees, of +course). + + + +0) Make sure your tree was originally cloned from the linux-2.5 tree +created by Linus. If your tree does not have this as its ancestor, it +is impossible to reliably exchange changesets. + + + +1) Pay attention to your commit text. The commit message that +accompanies each changeset you submit will live on forever in history, +and is used by Linus to accurately summarize the changes in each +pre-patch. Remember that there is no context, so + "fix for new scheduler changes" +would be too vague, but + "fix mips64 arch for new scheduler switch_to(), TIF_xxx semantics" +would be much better. + +You can and should use the command "bk comment -C" to update the +commit text, and improve it after the fact. This is very useful for +development: poor, quick descriptions during development, which get +cleaned up using "bk comment" before issuing the "bk push" to submit the +changes. + + + +2) Include an Internet-available URL for Linus to pull from, such as + + Pull from: http://gkernel.bkbits.net/net-drivers-2.5 + + + +3) Include a summary and "diffstat -p1" of each changeset that will be +downloaded, when Linus issues a "bk pull". The author auto-generates +these summaries using "bk changes -L ", to obtain a listing +of all the pending-to-send changesets, and their commit messages. + +It is important to show Linus what he will be downloading when he issues +a "bk pull", to reduce the time required to sift the changes once they +are downloaded to Linus's local machine. + +IMPORTANT NOTE: One of the features of BK is that your repository does +not have to be up to date, in order for Linus to receive your changes. +It is considered a courtesy to keep your repository fairly recent, to +lessen any potential merge work Linus may need to do. + + +4) Split up your changes. Each maintainer<->Linus situation is likely +to be slightly different here, so take this just as general advice. The +author splits up changes according to "themes" when merging with Linus. +Simultaneous pushes from local development go to special trees which +exist solely to house changes "queued" for Linus. Example of the trees: + + net-drivers-2.5 -- on-going net driver maintenance + vm-2.5 -- VM-related changes + fs-2.5 -- filesystem-related changes + +Linus then has much more freedom for pulling changes. He could (for +example) issue a "bk pull" on vm-2.5 and fs-2.5 trees, to merge their +changes, but hold off net-drivers-2.5 because of a change that needs +more discussion. + +Other maintainers may find that a single linus-pull-from tree is +adequate for passing BK changesets to him. + + + +Frequently Answered Questions +----------------------------- +1) How do I change the e-mail address shown in the changelog? +A. When you run "bk citool" or "bk commit", set environment + variables BK_USER and BK_HOST to the desired username + and host/domain name. + + +2) How do I use tags / get a diff between two kernel versions? +A. Pass the tags Linus uses to 'bk export'. + +ChangeSets are in a forward-progressing order, so it's pretty easy +to get a snapshot starting and ending at any two points in time. +Linus puts tags on each release and pre-release, so you could use +these two examples: + + bk export -tpatch -hdu -rv2.5.4,v2.5.5 | less + # creates patch-2.5.5 essentially + bk export -tpatch -du -rv2.5.5-pre1,v2.5.5 | less + # changes from pre1 to final + +A tag is just an alias for a specific changeset... and since changesets +are ordered, a tag is thus a marker for a specific point in time (or +specific state of the tree). + + +3) Is there an easy way to generate One Big Patch versus mainline, + for my long-lived kernel branch? +A. Yes. This requires BK 3.x, though. + + bk export -tpatch -r`bk repogca bk://linux.bkbits.net/linux-2.5`,+ + diff --git a/Documentation/BK-usage/bk-make-sum b/Documentation/BK-usage/bk-make-sum new file mode 100755 index 000000000000..58ca46a0fcc6 --- /dev/null +++ b/Documentation/BK-usage/bk-make-sum @@ -0,0 +1,34 @@ +#!/bin/sh -e +# DIR=$HOME/BK/axp-2.5 +# cd $DIR + +LINUS_REPO=$1 +DIRBASE=`basename $PWD` + +{ +cat </dev/null + +cat < (:D: :I:)\n$each(:C:){ (:C:)\n}\n}' - + +} > /tmp/linus.txt + +cat < 13/02/2002 +# +# Add diffstat output after Changelog 21/02/2002 + +PROG=bksend + +usage() { + echo "usage: $PROG -r" + echo -e "\twhere is of the form '1.23', '1.23..', '1.23..1.27'," + echo -e "\tor '+' to indicate the most recent revision" + + exit 1 +} + +case $1 in +-r) REV=$2; shift ;; +-r*) REV=`echo $1 | sed 's/^-r//'` ;; +*) echo "$PROG: no revision given, you probably don't want that";; +esac + +[ -z "$REV" ] && usage + +echo "You can import this changeset into BK by piping this whole message to:" +echo "'| bk receive [path to repository]' or apply the patch as usual." + +SEP="\n===================================================================\n\n" +echo -e $SEP +env PAGER=/bin/cat bk changes -r$REV +echo +bk export -tpatch -du -h -r$REV | diffstat +echo; echo +bk export -tpatch -du -h -r$REV +echo -e $SEP +bk send -wgzip_uu -r$REV - diff --git a/Documentation/BK-usage/bz64wrap b/Documentation/BK-usage/bz64wrap new file mode 100755 index 000000000000..be780876849f --- /dev/null +++ b/Documentation/BK-usage/bz64wrap @@ -0,0 +1,41 @@ +#!/bin/sh + +# bz64wrap - the sending side of a bzip2 | base64 stream +# Andreas Dilger Jan 2002 + + +PATH=$PATH:/usr/bin:/usr/local/bin:/usr/freeware/bin + +# A program to generate base64 encoding on stdout +BASE64_ENCODE="uuencode -m /dev/stdout" +BASE64_BEGIN= +BASE64_END= + +BZIP=NO +BASE64=NO + +# Test if we have the bzip program installed +bzip2 -c /dev/null > /dev/null 2>&1 && BZIP=YES + +# Test if uuencode can handle the -m (MIME) encoding option +$BASE64_ENCODE < /dev/null > /dev/null 2>&1 && BASE64=YES + +if [ $BASE64 = NO ]; then + BASE64_ENCODE=mimencode + BASE64_BEGIN="begin-base64 644 -" + BASE64_END="====" + + $BASE64_ENCODE < /dev/null > /dev/null 2>&1 && BASE64=YES +fi + +if [ $BZIP = NO -o $BASE64 = NO ]; then + echo "$0: can't use bz64 encoding: bzip2=$BZIP, $BASE64_ENCODE=$BASE64" + exit 1 +fi + +# Sadly, mimencode does not appear to have good "begin" and "end" markers +# like uuencode does, and it is picky about getting the right start/end of +# the base64 stream, so we handle this internally. +echo "$BASE64_BEGIN" +bzip2 -9 | $BASE64_ENCODE +echo "$BASE64_END" diff --git a/Documentation/BK-usage/cpcset b/Documentation/BK-usage/cpcset new file mode 100755 index 000000000000..b8faca97dab9 --- /dev/null +++ b/Documentation/BK-usage/cpcset @@ -0,0 +1,36 @@ +#!/bin/sh +# +# Purpose: Copy changeset patch and description from one +# repository to another, unrelated one. +# +# usage: cpcset [revision] [from-repository] [to-repository] +# + +REV=$1 +FROM=$2 +TO=$3 +TMPF=/tmp/cpcset.$$ + +rm -f $TMPF* + +CWD_SAVE=`pwd` +cd $FROM +bk changes -r$REV | \ + grep -v '^ChangeSet' | \ + sed -e 's/^ //g' > $TMPF.log + +USERHOST=`bk changes -r$REV | grep '^ChangeSet' | awk '{print $4}'` +export BK_USER=`echo $USERHOST | awk '-F@' '{print $1}'` +export BK_HOST=`echo $USERHOST | awk '-F@' '{print $2}'` + +bk export -tpatch -hdu -r$REV > $TMPF.patch && \ +cd $CWD_SAVE && \ +cd $TO && \ +bk import -tpatch -CFR -y"`cat $TMPF.log`" $TMPF.patch . && \ +bk commit -y"`cat $TMPF.log`" + +rm -f $TMPF* + +echo changeset $REV copied. +echo "" + diff --git a/Documentation/BK-usage/cset-to-linus b/Documentation/BK-usage/cset-to-linus new file mode 100755 index 000000000000..d28a96f8c618 --- /dev/null +++ b/Documentation/BK-usage/cset-to-linus @@ -0,0 +1,49 @@ +#!/usr/bin/perl -w + +use strict; + +my ($lhs, $rev, $tmp, $rhs, $s); +my @cset_text = (); +my @pipe_text = (); +my $have_cset = 0; + +while (<>) { + next if /^---/; + + if (($lhs, $tmp, $rhs) = (/^(ChangeSet\@)([^,]+)(, .*)$/)) { + &cset_rev if ($have_cset); + + $rev = $tmp; + $have_cset = 1; + + push(@cset_text, $_); + } + + elsif ($have_cset) { + push(@cset_text, $_); + } +} +&cset_rev if ($have_cset); +exit(0); + + +sub cset_rev { + my $empty_cset = 0; + + open PIPE, "bk export -tpatch -hdu -r $rev | diffstat -p1 2>/dev/null |" or die; + while ($s = ) { + $empty_cset = 1 if ($s =~ /0 files changed/); + push(@pipe_text, $s); + } + close(PIPE); + + if (! $empty_cset) { + print @cset_text; + print @pipe_text; + print "\n\n"; + } + + @pipe_text = (); + @cset_text = (); +} + diff --git a/Documentation/BK-usage/csets-to-patches b/Documentation/BK-usage/csets-to-patches new file mode 100755 index 000000000000..e2b81c35883f --- /dev/null +++ b/Documentation/BK-usage/csets-to-patches @@ -0,0 +1,44 @@ +#!/usr/bin/perl -w + +use strict; + +my ($lhs, $rev, $tmp, $rhs, $s); +my @cset_text = (); +my @pipe_text = (); +my $have_cset = 0; + +while (<>) { + next if /^---/; + + if (($lhs, $tmp, $rhs) = (/^(ChangeSet\@)([^,]+)(, .*)$/)) { + &cset_rev if ($have_cset); + + $rev = $tmp; + $have_cset = 1; + + push(@cset_text, $_); + } + + elsif ($have_cset) { + push(@cset_text, $_); + } +} +&cset_rev if ($have_cset); +exit(0); + + +sub cset_rev { + my $empty_cset = 0; + + system("bk export -tpatch -du -r $rev > /tmp/rev-$rev.patch"); + + if (! $empty_cset) { + print @cset_text; + print @pipe_text; + print "\n\n"; + } + + @pipe_text = (); + @cset_text = (); +} + diff --git a/Documentation/BK-usage/gcapatch b/Documentation/BK-usage/gcapatch new file mode 100755 index 000000000000..aaeb17dc7c7f --- /dev/null +++ b/Documentation/BK-usage/gcapatch @@ -0,0 +1,8 @@ +#!/bin/sh +# +# Purpose: Generate GNU diff of local changes versus canonical top-of-tree +# +# Usage: gcapatch > foo.patch +# + +bk export -tpatch -hdu -r`bk repogca bk://linux.bkbits.net/linux-2.5`,+ diff --git a/Documentation/BK-usage/unbz64wrap b/Documentation/BK-usage/unbz64wrap new file mode 100755 index 000000000000..4fc3e73e9a81 --- /dev/null +++ b/Documentation/BK-usage/unbz64wrap @@ -0,0 +1,25 @@ +#!/bin/sh + +# unbz64wrap - the receiving side of a bzip2 | base64 stream +# Andreas Dilger Jan 2002 + +# Sadly, mimencode does not appear to have good "begin" and "end" markers +# like uuencode does, and it is picky about getting the right start/end of +# the base64 stream, so we handle this explicitly here. + +PATH=$PATH:/usr/bin:/usr/local/bin:/usr/freeware/bin + +if mimencode -u < /dev/null > /dev/null 2>&1 ; then + SHOW= + while read LINE; do + case $LINE in + begin-base64*) SHOW=YES ;; + ====) SHOW= ;; + *) [ "$SHOW" ] && echo "$LINE" ;; + esac + done | mimencode -u | bunzip2 + exit $? +else + cat - | uudecode -o /dev/stdout | bunzip2 + exit $? +fi diff --git a/Documentation/BUG-HUNTING b/Documentation/BUG-HUNTING new file mode 100644 index 000000000000..ca29242dbc38 --- /dev/null +++ b/Documentation/BUG-HUNTING @@ -0,0 +1,92 @@ +[Sat Mar 2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)] + +This is how to track down a bug if you know nothing about kernel hacking. +It's a brute force approach but it works pretty well. + +You need: + + . A reproducible bug - it has to happen predictably (sorry) + . All the kernel tar files from a revision that worked to the + revision that doesn't + +You will then do: + + . Rebuild a revision that you believe works, install, and verify that. + . Do a binary search over the kernels to figure out which one + introduced the bug. I.e., suppose 1.3.28 didn't have the bug, but + you know that 1.3.69 does. Pick a kernel in the middle and build + that, like 1.3.50. Build & test; if it works, pick the mid point + between .50 and .69, else the mid point between .28 and .50. + . You'll narrow it down to the kernel that introduced the bug. You + can probably do better than this but it gets tricky. + + . Narrow it down to a subdirectory + + - Copy kernel that works into "test". Let's say that 3.62 works, + but 3.63 doesn't. So you diff -r those two kernels and come + up with a list of directories that changed. For each of those + directories: + + Copy the non-working directory next to the working directory + as "dir.63". + One directory at time, try moving the working directory to + "dir.62" and mv dir.63 dir"time, try + + mv dir dir.62 + mv dir.63 dir + find dir -name '*.[oa]' -print | xargs rm -f + + And then rebuild and retest. Assuming that all related + changes were contained in the sub directory, this should + isolate the change to a directory. + + Problems: changes in header files may have occurred; I've + found in my case that they were self explanatory - you may + or may not want to give up when that happens. + + . Narrow it down to a file + + - You can apply the same technique to each file in the directory, + hoping that the changes in that file are self contained. + + . Narrow it down to a routine + + - You can take the old file and the new file and manually create + a merged file that has + + #ifdef VER62 + routine() + { + ... + } + #else + routine() + { + ... + } + #endif + + And then walk through that file, one routine at a time and + prefix it with + + #define VER62 + /* both routines here */ + #undef VER62 + + Then recompile, retest, move the ifdefs until you find the one + that makes the difference. + +Finally, you take all the info that you have, kernel revisions, bug +description, the extent to which you have narrowed it down, and pass +that off to whomever you believe is the maintainer of that section. +A post to linux.dev.kernel isn't such a bad idea if you've done some +work to narrow it down. + +If you get it down to a routine, you'll probably get a fix in 24 hours. + +My apologies to Linus and the other kernel hackers for describing this +brute force approach, it's hardly what a kernel hacker would do. However, +it does work and it lets non-hackers help fix bugs. And it is cool +because Linux snapshots will let you do this - something that you can't +do with vendor supplied releases. + diff --git a/Documentation/Changes b/Documentation/Changes new file mode 100644 index 000000000000..caa6a5529b6b --- /dev/null +++ b/Documentation/Changes @@ -0,0 +1,410 @@ +Intro +===== + +This document is designed to provide a list of the minimum levels of +software necessary to run the 2.6 kernels, as well as provide brief +instructions regarding any other "Gotchas" users may encounter when +trying life on the Bleeding Edge. If upgrading from a pre-2.4.x +kernel, please consult the Changes file included with 2.4.x kernels for +additional information; most of that information will not be repeated +here. Basically, this document assumes that your system is already +functional and running at least 2.4.x kernels. + +This document is originally based on my "Changes" file for 2.0.x kernels +and therefore owes credit to the same people as that file (Jared Mauch, +Axel Boldt, Alessandro Sigala, and countless other users all over the +'net). + +The latest revision of this document, in various formats, can always +be found at . + +Feel free to translate this document. If you do so, please send me a +URL to your translation for inclusion in future revisions of this +document. + +Smotrite file , yavlyaushisya +russkim perevodom dannogo documenta. + +Visite para obtener la traducción +al español de este documento en varios formatos. + +Eine deutsche Version dieser Datei finden Sie unter +. + +Last updated: October 29th, 2002 + +Chris Ricker (kaboom@gatech.edu or chris.ricker@genetics.utah.edu). + +Current Minimal Requirements +============================ + +Upgrade to at *least* these software revisions before thinking you've +encountered a bug! If you're unsure what version you're currently +running, the suggested command should tell you. + +Again, keep in mind that this list assumes you are already +functionally running a Linux 2.4 kernel. Also, not all tools are +necessary on all systems; obviously, if you don't have any PCMCIA (PC +Card) hardware, for example, you probably needn't concern yourself +with pcmcia-cs. + +o Gnu C 2.95.3 # gcc --version +o Gnu make 3.79.1 # make --version +o binutils 2.12 # ld -v +o util-linux 2.10o # fdformat --version +o module-init-tools 0.9.10 # depmod -V +o e2fsprogs 1.29 # tune2fs +o jfsutils 1.1.3 # fsck.jfs -V +o reiserfsprogs 3.6.3 # reiserfsck -V 2>&1|grep reiserfsprogs +o xfsprogs 2.6.0 # xfs_db -V +o pcmcia-cs 3.1.21 # cardmgr -V +o quota-tools 3.09 # quota -V +o PPP 2.4.0 # pppd --version +o isdn4k-utils 3.1pre1 # isdnctrl 2>&1|grep version +o nfs-utils 1.0.5 # showmount --version +o procps 3.2.0 # ps --version +o oprofile 0.5.3 # oprofiled --version + +Kernel compilation +================== + +GCC +--- + +The gcc version requirements may vary depending on the type of CPU in your +computer. The next paragraph applies to users of x86 CPUs, but not +necessarily to users of other CPUs. Users of other CPUs should obtain +information about their gcc version requirements from another source. + +The recommended compiler for the kernel is gcc 2.95.x (x >= 3), and it +should be used when you need absolute stability. You may use gcc 3.0.x +instead if you wish, although it may cause problems. Later versions of gcc +have not received much testing for Linux kernel compilation, and there are +almost certainly bugs (mainly, but not exclusively, in the kernel) that +will need to be fixed in order to use these compilers. In any case, using +pgcc instead of plain gcc is just asking for trouble. + +The Red Hat gcc 2.96 compiler subtree can also be used to build this tree. +You should ensure you use gcc-2.96-74 or later. gcc-2.96-54 will not build +the kernel correctly. + +In addition, please pay attention to compiler optimization. Anything +greater than -O2 may not be wise. Similarly, if you choose to use gcc-2.95.x +or derivatives, be sure not to use -fstrict-aliasing (which, depending on +your version of gcc 2.95.x, may necessitate using -fno-strict-aliasing). + +Make +---- + +You will need Gnu make 3.79.1 or later to build the kernel. + +Binutils +-------- + +Linux on IA-32 has recently switched from using as86 to using gas for +assembling the 16-bit boot code, removing the need for as86 to compile +your kernel. This change does, however, mean that you need a recent +release of binutils. + +System utilities +================ + +Architectural changes +--------------------- + +DevFS has been obsoleted in favour of udev +(http://www.kernel.org/pub/linux/utils/kernel/hotplug/) + +32-bit UID support is now in place. Have fun! + +Linux documentation for functions is transitioning to inline +documentation via specially-formatted comments near their +definitions in the source. These comments can be combined with the +SGML templates in the Documentation/DocBook directory to make DocBook +files, which can then be converted by DocBook stylesheets to PostScript, +HTML, PDF files, and several other formats. In order to convert from +DocBook format to a format of your choice, you'll need to install Jade as +well as the desired DocBook stylesheets. + +Util-linux +---------- + +New versions of util-linux provide *fdisk support for larger disks, +support new options to mount, recognize more supported partition +types, have a fdformat which works with 2.4 kernels, and similar goodies. +You'll probably want to upgrade. + +Ksymoops +-------- + +If the unthinkable happens and your kernel oopses, you'll need a 2.4 +version of ksymoops to decode the report; see REPORTING-BUGS in the +root of the Linux source for more information. + +Module-Init-Tools +----------------- + +A new module loader is now in the kernel that requires module-init-tools +to use. It is backward compatible with the 2.4.x series kernels. + +Mkinitrd +-------- + +These changes to the /lib/modules file tree layout also require that +mkinitrd be upgraded. + +E2fsprogs +--------- + +The latest version of e2fsprogs fixes several bugs in fsck and +debugfs. Obviously, it's a good idea to upgrade. + +JFSutils +-------- + +The jfsutils package contains the utilities for the file system. +The following utilities are available: +o fsck.jfs - initiate replay of the transaction log, and check + and repair a JFS formatted partition. +o mkfs.jfs - create a JFS formatted partition. +o other file system utilities are also available in this package. + +Reiserfsprogs +------------- + +The reiserfsprogs package should be used for reiserfs-3.6.x +(Linux kernels 2.4.x). It is a combined package and contains working +versions of mkreiserfs, resize_reiserfs, debugreiserfs and +reiserfsck. These utils work on both i386 and alpha platforms. + +Xfsprogs +-------- + +The latest version of xfsprogs contains mkfs.xfs, xfs_db, and the +xfs_repair utilities, among others, for the XFS filesystem. It is +architecture independent and any version from 2.0.0 onward should +work correctly with this version of the XFS kernel code (2.6.0 or +later is recommended, due to some significant improvements). + + +Pcmcia-cs +--------- + +PCMCIA (PC Card) support is now partially implemented in the main +kernel source. Pay attention when you recompile your kernel ;-). +Also, be sure to upgrade to the latest pcmcia-cs release. + +Quota-tools +----------- + +Support for 32 bit uid's and gid's is required if you want to use +the newer version 2 quota format. Quota-tools version 3.07 and +newer has this support. Use the recommended version or newer +from the table above. + +Intel IA32 microcode +-------------------- + +A driver has been added to allow updating of Intel IA32 microcode, +accessible as both a devfs regular file and as a normal (misc) +character device. If you are not using devfs you may need to: + +mkdir /dev/cpu +mknod /dev/cpu/microcode c 10 184 +chmod 0644 /dev/cpu/microcode + +as root before you can use this. You'll probably also want to +get the user-space microcode_ctl utility to use with this. + +Powertweak +---------- + +If you are running v0.1.17 or earlier, you should upgrade to +version v0.99.0 or higher. Running old versions may cause problems +with programs using shared memory. + +udev +---- +udev is a userspace application for populating /dev dynamically with +only entries for devices actually present. udev replaces devfs. + +Networking +========== + +General changes +--------------- + +If you have advanced network configuration needs, you should probably +consider using the network tools from ip-route2. + +Packet Filter / NAT +------------------- +The packet filtering and NAT code uses the same tools like the previous 2.4.x +kernel series (iptables). It still includes backwards-compatibility modules +for 2.2.x-style ipchains and 2.0.x-style ipfwadm. + +PPP +--- + +The PPP driver has been restructured to support multilink and to +enable it to operate over diverse media layers. If you use PPP, +upgrade pppd to at least 2.4.0. + +If you are not using devfs, you must have the device file /dev/ppp +which can be made by: + +mknod /dev/ppp c 108 0 + +as root. + +If you use devfsd and build ppp support as modules, you will need +the following in your /etc/devfsd.conf file: + +LOOKUP PPP MODLOAD + +Isdn4k-utils +------------ + +Due to changes in the length of the phone number field, isdn4k-utils +needs to be recompiled or (preferably) upgraded. + +NFS-utils +--------- + +In 2.4 and earlier kernels, the nfs server needed to know about any +client that expected to be able to access files via NFS. This +information would be given to the kernel by "mountd" when the client +mounted the filesystem, or by "exportfs" at system startup. exportfs +would take information about active clients from /var/lib/nfs/rmtab. + +This approach is quite fragile as it depends on rmtab being correct +which is not always easy, particularly when trying to implement +fail-over. Even when the system is working well, rmtab suffers from +getting lots of old entries that never get removed. + +With 2.6 we have the option of having the kernel tell mountd when it +gets a request from an unknown host, and mountd can give appropriate +export information to the kernel. This removes the dependency on +rmtab and means that the kernel only needs to know about currently +active clients. + +To enable this new functionality, you need to: + + mount -t nfsd nfsd /proc/fs/nfs + +before running exportfs or mountd. It is recommended that all NFS +services be protected from the internet-at-large by a firewall where +that is possible. + +Getting updated software +======================== + +Kernel compilation +****************** + +gcc 2.95.3 +---------- +o + +Make +---- +o + +Binutils +-------- +o + +System utilities +**************** + +Util-linux +---------- +o + +Ksymoops +-------- +o + +Module-Init-Tools +----------------- +o + +Mkinitrd +-------- +o + +E2fsprogs +--------- +o + +JFSutils +-------- +o + +Reiserfsprogs +------------- +o + +Xfsprogs +-------- +o + +Pcmcia-cs +--------- +o + +Quota-tools +---------- +o + +Jade +---- +o + +DocBook Stylesheets +------------------- +o + +Intel P6 microcode +------------------ +o + +Powertweak +---------- +o + +udev +---- +o + +Networking +********** + +PPP +--- +o + +Isdn4k-utils +------------ +o + +NFS-utils +--------- +o + +Iptables +-------- +o + +Ip-route2 +--------- +o + +OProfile +-------- +o + +NFS-Utils +--------- +o + diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle new file mode 100644 index 000000000000..f25b3953f513 --- /dev/null +++ b/Documentation/CodingStyle @@ -0,0 +1,431 @@ + + Linux kernel coding style + +This is a short document describing the preferred coding style for the +linux kernel. Coding style is very personal, and I won't _force_ my +views on anybody, but this is what goes for anything that I have to be +able to maintain, and I'd prefer it for most other things too. Please +at least consider the points made here. + +First off, I'd suggest printing out a copy of the GNU coding standards, +and NOT read it. Burn them, it's a great symbolic gesture. + +Anyway, here goes: + + + Chapter 1: Indentation + +Tabs are 8 characters, and thus indentations are also 8 characters. +There are heretic movements that try to make indentations 4 (or even 2!) +characters deep, and that is akin to trying to define the value of PI to +be 3. + +Rationale: The whole idea behind indentation is to clearly define where +a block of control starts and ends. Especially when you've been looking +at your screen for 20 straight hours, you'll find it a lot easier to see +how the indentation works if you have large indentations. + +Now, some people will claim that having 8-character indentations makes +the code move too far to the right, and makes it hard to read on a +80-character terminal screen. The answer to that is that if you need +more than 3 levels of indentation, you're screwed anyway, and should fix +your program. + +In short, 8-char indents make things easier to read, and have the added +benefit of warning you when you're nesting your functions too deep. +Heed that warning. + +Don't put multiple statements on a single line unless you have +something to hide: + + if (condition) do_this; + do_something_everytime; + +Outside of comments, documentation and except in Kconfig, spaces are never +used for indentation, and the above example is deliberately broken. + +Get a decent editor and don't leave whitespace at the end of lines. + + + Chapter 2: Breaking long lines and strings + +Coding style is all about readability and maintainability using commonly +available tools. + +The limit on the length of lines is 80 columns and this is a hard limit. + +Statements longer than 80 columns will be broken into sensible chunks. +Descendants are always substantially shorter than the parent and are placed +substantially to the right. The same applies to function headers with a long +argument list. Long strings are as well broken into shorter strings. + +void fun(int a, int b, int c) +{ + if (condition) + printk(KERN_WARNING "Warning this is a long printk with " + "3 parameters a: %u b: %u " + "c: %u \n", a, b, c); + else + next_statement; +} + + Chapter 3: Placing Braces + +The other issue that always comes up in C styling is the placement of +braces. Unlike the indent size, there are few technical reasons to +choose one placement strategy over the other, but the preferred way, as +shown to us by the prophets Kernighan and Ritchie, is to put the opening +brace last on the line, and put the closing brace first, thusly: + + if (x is true) { + we do y + } + +However, there is one special case, namely functions: they have the +opening brace at the beginning of the next line, thus: + + int function(int x) + { + body of function + } + +Heretic people all over the world have claimed that this inconsistency +is ... well ... inconsistent, but all right-thinking people know that +(a) K&R are _right_ and (b) K&R are right. Besides, functions are +special anyway (you can't nest them in C). + +Note that the closing brace is empty on a line of its own, _except_ in +the cases where it is followed by a continuation of the same statement, +ie a "while" in a do-statement or an "else" in an if-statement, like +this: + + do { + body of do-loop + } while (condition); + +and + + if (x == y) { + .. + } else if (x > y) { + ... + } else { + .... + } + +Rationale: K&R. + +Also, note that this brace-placement also minimizes the number of empty +(or almost empty) lines, without any loss of readability. Thus, as the +supply of new-lines on your screen is not a renewable resource (think +25-line terminal screens here), you have more empty lines to put +comments on. + + + Chapter 4: Naming + +C is a Spartan language, and so should your naming be. Unlike Modula-2 +and Pascal programmers, C programmers do not use cute names like +ThisVariableIsATemporaryCounter. A C programmer would call that +variable "tmp", which is much easier to write, and not the least more +difficult to understand. + +HOWEVER, while mixed-case names are frowned upon, descriptive names for +global variables are a must. To call a global function "foo" is a +shooting offense. + +GLOBAL variables (to be used only if you _really_ need them) need to +have descriptive names, as do global functions. If you have a function +that counts the number of active users, you should call that +"count_active_users()" or similar, you should _not_ call it "cntusr()". + +Encoding the type of a function into the name (so-called Hungarian +notation) is brain damaged - the compiler knows the types anyway and can +check those, and it only confuses the programmer. No wonder MicroSoft +makes buggy programs. + +LOCAL variable names should be short, and to the point. If you have +some random integer loop counter, it should probably be called "i". +Calling it "loop_counter" is non-productive, if there is no chance of it +being mis-understood. Similarly, "tmp" can be just about any type of +variable that is used to hold a temporary value. + +If you are afraid to mix up your local variable names, you have another +problem, which is called the function-growth-hormone-imbalance syndrome. +See next chapter. + + + Chapter 5: Functions + +Functions should be short and sweet, and do just one thing. They should +fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24, +as we all know), and do one thing and do that well. + +The maximum length of a function is inversely proportional to the +complexity and indentation level of that function. So, if you have a +conceptually simple function that is just one long (but simple) +case-statement, where you have to do lots of small things for a lot of +different cases, it's OK to have a longer function. + +However, if you have a complex function, and you suspect that a +less-than-gifted first-year high-school student might not even +understand what the function is all about, you should adhere to the +maximum limits all the more closely. Use helper functions with +descriptive names (you can ask the compiler to in-line them if you think +it's performance-critical, and it will probably do a better job of it +than you would have done). + +Another measure of the function is the number of local variables. They +shouldn't exceed 5-10, or you're doing something wrong. Re-think the +function, and split it into smaller pieces. A human brain can +generally easily keep track of about 7 different things, anything more +and it gets confused. You know you're brilliant, but maybe you'd like +to understand what you did 2 weeks from now. + + + Chapter 6: Centralized exiting of functions + +Albeit deprecated by some people, the equivalent of the goto statement is +used frequently by compilers in form of the unconditional jump instruction. + +The goto statement comes in handy when a function exits from multiple +locations and some common work such as cleanup has to be done. + +The rationale is: + +- unconditional statements are easier to understand and follow +- nesting is reduced +- errors by not updating individual exit points when making + modifications are prevented +- saves the compiler work to optimize redundant code away ;) + +int fun(int ) +{ + int result = 0; + char *buffer = kmalloc(SIZE); + + if (buffer == NULL) + return -ENOMEM; + + if (condition1) { + while (loop1) { + ... + } + result = 1; + goto out; + } + ... +out: + kfree(buffer); + return result; +} + + Chapter 7: Commenting + +Comments are good, but there is also a danger of over-commenting. NEVER +try to explain HOW your code works in a comment: it's much better to +write the code so that the _working_ is obvious, and it's a waste of +time to explain badly written code. + +Generally, you want your comments to tell WHAT your code does, not HOW. +Also, try to avoid putting comments inside a function body: if the +function is so complex that you need to separately comment parts of it, +you should probably go back to chapter 5 for a while. You can make +small comments to note or warn about something particularly clever (or +ugly), but try to avoid excess. Instead, put the comments at the head +of the function, telling people what it does, and possibly WHY it does +it. + + + Chapter 8: You've made a mess of it + +That's OK, we all do. You've probably been told by your long-time Unix +user helper that "GNU emacs" automatically formats the C sources for +you, and you've noticed that yes, it does do that, but the defaults it +uses are less than desirable (in fact, they are worse than random +typing - an infinite number of monkeys typing into GNU emacs would never +make a good program). + +So, you can either get rid of GNU emacs, or change it to use saner +values. To do the latter, you can stick the following in your .emacs file: + +(defun linux-c-mode () + "C mode with adjusted defaults for use with the Linux kernel." + (interactive) + (c-mode) + (c-set-style "K&R") + (setq tab-width 8) + (setq indent-tabs-mode t) + (setq c-basic-offset 8)) + +This will define the M-x linux-c-mode command. When hacking on a +module, if you put the string -*- linux-c -*- somewhere on the first +two lines, this mode will be automatically invoked. Also, you may want +to add + +(setq auto-mode-alist (cons '("/usr/src/linux.*/.*\\.[ch]$" . linux-c-mode) + auto-mode-alist)) + +to your .emacs file if you want to have linux-c-mode switched on +automagically when you edit source files under /usr/src/linux. + +But even if you fail in getting emacs to do sane formatting, not +everything is lost: use "indent". + +Now, again, GNU indent has the same brain-dead settings that GNU emacs +has, which is why you need to give it a few command line options. +However, that's not too bad, because even the makers of GNU indent +recognize the authority of K&R (the GNU people aren't evil, they are +just severely misguided in this matter), so you just give indent the +options "-kr -i8" (stands for "K&R, 8 character indents"), or use +"scripts/Lindent", which indents in the latest style. + +"indent" has a lot of options, and especially when it comes to comment +re-formatting you may want to take a look at the man page. But +remember: "indent" is not a fix for bad programming. + + + Chapter 9: Configuration-files + +For configuration options (arch/xxx/Kconfig, and all the Kconfig files), +somewhat different indentation is used. + +Help text is indented with 2 spaces. + +if CONFIG_EXPERIMENTAL + tristate CONFIG_BOOM + default n + help + Apply nitroglycerine inside the keyboard (DANGEROUS) + bool CONFIG_CHEER + depends on CONFIG_BOOM + default y + help + Output nice messages when you explode +endif + +Generally, CONFIG_EXPERIMENTAL should surround all options not considered +stable. All options that are known to trash data (experimental write- +support for file-systems, for instance) should be denoted (DANGEROUS), other +experimental options should be denoted (EXPERIMENTAL). + + + Chapter 10: Data structures + +Data structures that have visibility outside the single-threaded +environment they are created and destroyed in should always have +reference counts. In the kernel, garbage collection doesn't exist (and +outside the kernel garbage collection is slow and inefficient), which +means that you absolutely _have_ to reference count all your uses. + +Reference counting means that you can avoid locking, and allows multiple +users to have access to the data structure in parallel - and not having +to worry about the structure suddenly going away from under them just +because they slept or did something else for a while. + +Note that locking is _not_ a replacement for reference counting. +Locking is used to keep data structures coherent, while reference +counting is a memory management technique. Usually both are needed, and +they are not to be confused with each other. + +Many data structures can indeed have two levels of reference counting, +when there are users of different "classes". The subclass count counts +the number of subclass users, and decrements the global count just once +when the subclass count goes to zero. + +Examples of this kind of "multi-level-reference-counting" can be found in +memory management ("struct mm_struct": mm_users and mm_count), and in +filesystem code ("struct super_block": s_count and s_active). + +Remember: if another thread can find your data structure, and you don't +have a reference count on it, you almost certainly have a bug. + + + Chapter 11: Macros, Enums, Inline functions and RTL + +Names of macros defining constants and labels in enums are capitalized. + +#define CONSTANT 0x12345 + +Enums are preferred when defining several related constants. + +CAPITALIZED macro names are appreciated but macros resembling functions +may be named in lower case. + +Generally, inline functions are preferable to macros resembling functions. + +Macros with multiple statements should be enclosed in a do - while block: + +#define macrofun(a, b, c) \ + do { \ + if (a == 5) \ + do_this(b, c); \ + } while (0) + +Things to avoid when using macros: + +1) macros that affect control flow: + +#define FOO(x) \ + do { \ + if (blah(x) < 0) \ + return -EBUGGERED; \ + } while(0) + +is a _very_ bad idea. It looks like a function call but exits the "calling" +function; don't break the internal parsers of those who will read the code. + +2) macros that depend on having a local variable with a magic name: + +#define FOO(val) bar(index, val) + +might look like a good thing, but it's confusing as hell when one reads the +code and it's prone to breakage from seemingly innocent changes. + +3) macros with arguments that are used as l-values: FOO(x) = y; will +bite you if somebody e.g. turns FOO into an inline function. + +4) forgetting about precedence: macros defining constants using expressions +must enclose the expression in parentheses. Beware of similar issues with +macros using parameters. + +#define CONSTANT 0x4000 +#define CONSTEXP (CONSTANT | 3) + +The cpp manual deals with macros exhaustively. The gcc internals manual also +covers RTL which is used frequently with assembly language in the kernel. + + + Chapter 12: Printing kernel messages + +Kernel developers like to be seen as literate. Do mind the spelling +of kernel messages to make a good impression. Do not use crippled +words like "dont" and use "do not" or "don't" instead. + +Kernel messages do not have to be terminated with a period. + +Printing numbers in parentheses (%d) adds no value and should be avoided. + + + Chapter 13: References + +The C Programming Language, Second Edition +by Brian W. Kernighan and Dennis M. Ritchie. +Prentice Hall, Inc., 1988. +ISBN 0-13-110362-8 (paperback), 0-13-110370-9 (hardback). +URL: http://cm.bell-labs.com/cm/cs/cbook/ + +The Practice of Programming +by Brian W. Kernighan and Rob Pike. +Addison-Wesley, Inc., 1999. +ISBN 0-201-61586-X. +URL: http://cm.bell-labs.com/cm/cs/tpop/ + +GNU manuals - where in compliance with K&R and this text - for cpp, gcc, +gcc internals and indent, all available from http://www.gnu.org + +WG14 is the international standardization working group for the programming +language C, URL: http://std.dkuug.dk/JTC1/SC22/WG14/ + +-- +Last updated on 16 February 2004 by a community effort on LKML. diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt new file mode 100644 index 000000000000..6ee3cd6134df --- /dev/null +++ b/Documentation/DMA-API.txt @@ -0,0 +1,526 @@ + Dynamic DMA mapping using the generic device + ============================================ + + James E.J. Bottomley + +This document describes the DMA API. For a more gentle introduction +phrased in terms of the pci_ equivalents (and actual examples) see +DMA-mapping.txt + +This API is split into two pieces. Part I describes the API and the +corresponding pci_ API. Part II describes the extensions to the API +for supporting non-consistent memory machines. Unless you know that +your driver absolutely has to support non-consistent platforms (this +is usually only legacy platforms) you should only use the API +described in part I. + +Part I - pci_ and dma_ Equivalent API +------------------------------------- + +To get the pci_ API, you must #include +To get the dma_ API, you must #include + + +Part Ia - Using large dma-coherent buffers +------------------------------------------ + +void * +dma_alloc_coherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, int flag) +void * +pci_alloc_consistent(struct pci_dev *dev, size_t size, + dma_addr_t *dma_handle) + +Consistent memory is memory for which a write by either the device or +the processor can immediately be read by the processor or device +without having to worry about caching effects. + +This routine allocates a region of bytes of consistent memory. +it also returns a which may be cast to an unsigned +integer the same width as the bus and used as the physical address +base of the region. + +Returns: a pointer to the allocated region (in the processor's virtual +address space) or NULL if the allocation failed. + +Note: consistent memory can be expensive on some platforms, and the +minimum allocation length may be as big as a page, so you should +consolidate your requests for consistent memory as much as possible. +The simplest way to do that is to use the dma_pool calls (see below). + +The flag parameter (dma_alloc_coherent only) allows the caller to +specify the GFP_ flags (see kmalloc) for the allocation (the +implementation may chose to ignore flags that affect the location of +the returned memory, like GFP_DMA). For pci_alloc_consistent, you +must assume GFP_ATOMIC behaviour. + +void +dma_free_coherent(struct device *dev, size_t size, void *cpu_addr + dma_addr_t dma_handle) +void +pci_free_consistent(struct pci_dev *dev, size_t size, void *cpu_addr + dma_addr_t dma_handle) + +Free the region of consistent memory you previously allocated. dev, +size and dma_handle must all be the same as those passed into the +consistent allocate. cpu_addr must be the virtual address returned by +the consistent allocate + + +Part Ib - Using small dma-coherent buffers +------------------------------------------ + +To get this part of the dma_ API, you must #include + +Many drivers need lots of small dma-coherent memory regions for DMA +descriptors or I/O buffers. Rather than allocating in units of a page +or more using dma_alloc_coherent(), you can use DMA pools. These work +much like a kmem_cache_t, except that they use the dma-coherent allocator +not __get_free_pages(). Also, they understand common hardware constraints +for alignment, like queue heads needing to be aligned on N byte boundaries. + + + struct dma_pool * + dma_pool_create(const char *name, struct device *dev, + size_t size, size_t align, size_t alloc); + + struct pci_pool * + pci_pool_create(const char *name, struct pci_device *dev, + size_t size, size_t align, size_t alloc); + +The pool create() routines initialize a pool of dma-coherent buffers +for use with a given device. It must be called in a context which +can sleep. + +The "name" is for diagnostics (like a kmem_cache_t name); dev and size +are like what you'd pass to dma_alloc_coherent(). The device's hardware +alignment requirement for this type of data is "align" (which is expressed +in bytes, and must be a power of two). If your device has no boundary +crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated +from this pool must not cross 4KByte boundaries. + + + void *dma_pool_alloc(struct dma_pool *pool, int gfp_flags, + dma_addr_t *dma_handle); + + void *pci_pool_alloc(struct pci_pool *pool, int gfp_flags, + dma_addr_t *dma_handle); + +This allocates memory from the pool; the returned memory will meet the size +and alignment requirements specified at creation time. Pass GFP_ATOMIC to +prevent blocking, or if it's permitted (not in_interrupt, not holding SMP locks) +pass GFP_KERNEL to allow blocking. Like dma_alloc_coherent(), this returns +two values: an address usable by the cpu, and the dma address usable by the +pool's device. + + + void dma_pool_free(struct dma_pool *pool, void *vaddr, + dma_addr_t addr); + + void pci_pool_free(struct pci_pool *pool, void *vaddr, + dma_addr_t addr); + +This puts memory back into the pool. The pool is what was passed to +the the pool allocation routine; the cpu and dma addresses are what +were returned when that routine allocated the memory being freed. + + + void dma_pool_destroy(struct dma_pool *pool); + + void pci_pool_destroy(struct pci_pool *pool); + +The pool destroy() routines free the resources of the pool. They must be +called in a context which can sleep. Make sure you've freed all allocated +memory back to the pool before you destroy it. + + +Part Ic - DMA addressing limitations +------------------------------------ + +int +dma_supported(struct device *dev, u64 mask) +int +pci_dma_supported(struct device *dev, u64 mask) + +Checks to see if the device can support DMA to the memory described by +mask. + +Returns: 1 if it can and 0 if it can't. + +Notes: This routine merely tests to see if the mask is possible. It +won't change the current mask settings. It is more intended as an +internal API for use by the platform than an external API for use by +driver writers. + +int +dma_set_mask(struct device *dev, u64 mask) +int +pci_set_dma_mask(struct pci_device *dev, u64 mask) + +Checks to see if the mask is possible and updates the device +parameters if it is. + +Returns: 0 if successful and a negative error if not. + +u64 +dma_get_required_mask(struct device *dev) + +After setting the mask with dma_set_mask(), this API returns the +actual mask (within that already set) that the platform actually +requires to operate efficiently. Usually this means the returned mask +is the minimum required to cover all of memory. Examining the +required mask gives drivers with variable descriptor sizes the +opportunity to use smaller descriptors as necessary. + +Requesting the required mask does not alter the current mask. If you +wish to take advantage of it, you should issue another dma_set_mask() +call to lower the mask again. + + +Part Id - Streaming DMA mappings +-------------------------------- + +dma_addr_t +dma_map_single(struct device *dev, void *cpu_addr, size_t size, + enum dma_data_direction direction) +dma_addr_t +pci_map_single(struct device *dev, void *cpu_addr, size_t size, + int direction) + +Maps a piece of processor virtual memory so it can be accessed by the +device and returns the physical handle of the memory. + +The direction for both api's may be converted freely by casting. +However the dma_ API uses a strongly typed enumerator for its +direction: + +DMA_NONE = PCI_DMA_NONE no direction (used for + debugging) +DMA_TO_DEVICE = PCI_DMA_TODEVICE data is going from the + memory to the device +DMA_FROM_DEVICE = PCI_DMA_FROMDEVICE data is coming from + the device to the + memory +DMA_BIDIRECTIONAL = PCI_DMA_BIDIRECTIONAL direction isn't known + +Notes: Not all memory regions in a machine can be mapped by this +API. Further, regions that appear to be physically contiguous in +kernel virtual space may not be contiguous as physical memory. Since +this API does not provide any scatter/gather capability, it will fail +if the user tries to map a non physically contiguous piece of memory. +For this reason, it is recommended that memory mapped by this API be +obtained only from sources which guarantee to be physically contiguous +(like kmalloc). + +Further, the physical address of the memory must be within the +dma_mask of the device (the dma_mask represents a bit mask of the +addressable region for the device. i.e. if the physical address of +the memory anded with the dma_mask is still equal to the physical +address, then the device can perform DMA to the memory). In order to +ensure that the memory allocated by kmalloc is within the dma_mask, +the driver may specify various platform dependent flags to restrict +the physical memory range of the allocation (e.g. on x86, GFP_DMA +guarantees to be within the first 16Mb of available physical memory, +as required by ISA devices). + +Note also that the above constraints on physical contiguity and +dma_mask may not apply if the platform has an IOMMU (a device which +supplies a physical to virtual mapping between the I/O memory bus and +the device). However, to be portable, device driver writers may *not* +assume that such an IOMMU exists. + +Warnings: Memory coherency operates at a granularity called the cache +line width. In order for memory mapped by this API to operate +correctly, the mapped region must begin exactly on a cache line +boundary and end exactly on one (to prevent two separately mapped +regions from sharing a single cache line). Since the cache line size +may not be known at compile time, the API will not enforce this +requirement. Therefore, it is recommended that driver writers who +don't take special care to determine the cache line size at run time +only map virtual regions that begin and end on page boundaries (which +are guaranteed also to be cache line boundaries). + +DMA_TO_DEVICE synchronisation must be done after the last modification +of the memory region by the software and before it is handed off to +the driver. Once this primitive is used. Memory covered by this +primitive should be treated as read only by the device. If the device +may write to it at any point, it should be DMA_BIDIRECTIONAL (see +below). + +DMA_FROM_DEVICE synchronisation must be done before the driver +accesses data that may be changed by the device. This memory should +be treated as read only by the driver. If the driver needs to write +to it at any point, it should be DMA_BIDIRECTIONAL (see below). + +DMA_BIDIRECTIONAL requires special handling: it means that the driver +isn't sure if the memory was modified before being handed off to the +device and also isn't sure if the device will also modify it. Thus, +you must always sync bidirectional memory twice: once before the +memory is handed off to the device (to make sure all memory changes +are flushed from the processor) and once before the data may be +accessed after being used by the device (to make sure any processor +cache lines are updated with data that the device may have changed. + +void +dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, + enum dma_data_direction direction) +void +pci_unmap_single(struct pci_dev *hwdev, dma_addr_t dma_addr, + size_t size, int direction) + +Unmaps the region previously mapped. All the parameters passed in +must be identical to those passed in (and returned) by the mapping +API. + +dma_addr_t +dma_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction direction) +dma_addr_t +pci_map_page(struct pci_dev *hwdev, struct page *page, + unsigned long offset, size_t size, int direction) +void +dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, + enum dma_data_direction direction) +void +pci_unmap_page(struct pci_dev *hwdev, dma_addr_t dma_address, + size_t size, int direction) + +API for mapping and unmapping for pages. All the notes and warnings +for the other mapping APIs apply here. Also, although the +and parameters are provided to do partial page mapping, it is +recommended that you never use these unless you really know what the +cache width is. + +int +dma_mapping_error(dma_addr_t dma_addr) + +int +pci_dma_mapping_error(dma_addr_t dma_addr) + +In some circumstances dma_map_single and dma_map_page will fail to create +a mapping. A driver can check for these errors by testing the returned +dma address with dma_mapping_error(). A non zero return value means the mapping +could not be created and the driver should take appropriate action (eg +reduce current DMA mapping usage or delay and try again later). + +int +dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction direction) +int +pci_map_sg(struct pci_dev *hwdev, struct scatterlist *sg, + int nents, int direction) + +Maps a scatter gather list from the block layer. + +Returns: the number of physical segments mapped (this may be shorted +than passed in if the block layer determines that some +elements of the scatter/gather list are physically adjacent and thus +may be mapped with a single entry). + +Please note that the sg cannot be mapped again if it has been mapped once. +The mapping process is allowed to destroy information in the sg. + +As with the other mapping interfaces, dma_map_sg can fail. When it +does, 0 is returned and a driver must take appropriate action. It is +critical that the driver do something, in the case of a block driver +aborting the request or even oopsing is better than doing nothing and +corrupting the filesystem. + +void +dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nhwentries, + enum dma_data_direction direction) +void +pci_unmap_sg(struct pci_dev *hwdev, struct scatterlist *sg, + int nents, int direction) + +unmap the previously mapped scatter/gather list. All the parameters +must be the same as those and passed in to the scatter/gather mapping +API. + +Note: must be the number you passed in, *not* the number of +physical entries returned. + +void +dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size, + enum dma_data_direction direction) +void +pci_dma_sync_single(struct pci_dev *hwdev, dma_addr_t dma_handle, + size_t size, int direction) +void +dma_sync_sg(struct device *dev, struct scatterlist *sg, int nelems, + enum dma_data_direction direction) +void +pci_dma_sync_sg(struct pci_dev *hwdev, struct scatterlist *sg, + int nelems, int direction) + +synchronise a single contiguous or scatter/gather mapping. All the +parameters must be the same as those passed into the single mapping +API. + +Notes: You must do this: + +- Before reading values that have been written by DMA from the device + (use the DMA_FROM_DEVICE direction) +- After writing values that will be written to the device using DMA + (use the DMA_TO_DEVICE) direction +- before *and* after handing memory to the device if the memory is + DMA_BIDIRECTIONAL + +See also dma_map_single(). + + +Part II - Advanced dma_ usage +----------------------------- + +Warning: These pieces of the DMA API have no PCI equivalent. They +should also not be used in the majority of cases, since they cater for +unlikely corner cases that don't belong in usual drivers. + +If you don't understand how cache line coherency works between a +processor and an I/O device, you should not be using this part of the +API at all. + +void * +dma_alloc_noncoherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, int flag) + +Identical to dma_alloc_coherent() except that the platform will +choose to return either consistent or non-consistent memory as it sees +fit. By using this API, you are guaranteeing to the platform that you +have all the correct and necessary sync points for this memory in the +driver should it choose to return non-consistent memory. + +Note: where the platform can return consistent memory, it will +guarantee that the sync points become nops. + +Warning: Handling non-consistent memory is a real pain. You should +only ever use this API if you positively know your driver will be +required to work on one of the rare (usually non-PCI) architectures +that simply cannot make consistent memory. + +void +dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_handle) + +free memory allocated by the nonconsistent API. All parameters must +be identical to those passed in (and returned by +dma_alloc_noncoherent()). + +int +dma_is_consistent(dma_addr_t dma_handle) + +returns true if the memory pointed to by the dma_handle is actually +consistent. + +int +dma_get_cache_alignment(void) + +returns the processor cache alignment. This is the absolute minimum +alignment *and* width that you must observe when either mapping +memory or doing partial flushes. + +Notes: This API may return a number *larger* than the actual cache +line, but it will guarantee that one or more cache lines fit exactly +into the width returned by this call. It will also always be a power +of two for easy alignment + +void +dma_sync_single_range(struct device *dev, dma_addr_t dma_handle, + unsigned long offset, size_t size, + enum dma_data_direction direction) + +does a partial sync. starting at offset and continuing for size. You +must be careful to observe the cache alignment and width when doing +anything like this. You must also be extra careful about accessing +memory you intend to sync partially. + +void +dma_cache_sync(void *vaddr, size_t size, + enum dma_data_direction direction) + +Do a partial sync of memory that was allocated by +dma_alloc_noncoherent(), starting at virtual address vaddr and +continuing on for size. Again, you *must* observe the cache line +boundaries when doing this. + +int +dma_declare_coherent_memory(struct device *dev, dma_addr_t bus_addr, + dma_addr_t device_addr, size_t size, int + flags) + + +Declare region of memory to be handed out by dma_alloc_coherent when +it's asked for coherent memory for this device. + +bus_addr is the physical address to which the memory is currently +assigned in the bus responding region (this will be used by the +platform to perform the mapping) + +device_addr is the physical address the device needs to be programmed +with actually to address this memory (this will be handed out as the +dma_addr_t in dma_alloc_coherent()) + +size is the size of the area (must be multiples of PAGE_SIZE). + +flags can be or'd together and are + +DMA_MEMORY_MAP - request that the memory returned from +dma_alloc_coherent() be directly writeable. + +DMA_MEMORY_IO - request that the memory returned from +dma_alloc_coherent() be addressable using read/write/memcpy_toio etc. + +One or both of these flags must be present + +DMA_MEMORY_INCLUDES_CHILDREN - make the declared memory be allocated by +dma_alloc_coherent of any child devices of this one (for memory residing +on a bridge). + +DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions. +Do not allow dma_alloc_coherent() to fall back to system memory when +it's out of memory in the declared region. + +The return value will be either DMA_MEMORY_MAP or DMA_MEMORY_IO and +must correspond to a passed in flag (i.e. no returning DMA_MEMORY_IO +if only DMA_MEMORY_MAP were passed in) for success or zero for +failure. + +Note, for DMA_MEMORY_IO returns, all subsequent memory returned by +dma_alloc_coherent() may no longer be accessed directly, but instead +must be accessed using the correct bus functions. If your driver +isn't prepared to handle this contingency, it should not specify +DMA_MEMORY_IO in the input flags. + +As a simplification for the platforms, only *one* such region of +memory may be declared per device. + +For reasons of efficiency, most platforms choose to track the declared +region only at the granularity of a page. For smaller allocations, +you should use the dma_pool() API. + +void +dma_release_declared_memory(struct device *dev) + +Remove the memory region previously declared from the system. This +API performs *no* in-use checking for this region and will return +unconditionally having removed all the required structures. It is the +drivers job to ensure that no parts of this memory region are +currently in use. + +void * +dma_mark_declared_memory_occupied(struct device *dev, + dma_addr_t device_addr, size_t size) + +This is used to occupy specific regions of the declared space +(dma_alloc_coherent() will hand out the first free region it finds). + +device_addr is the *device* address of the region requested + +size is the size (and should be a page sized multiple). + +The return value will be either a pointer to the processor virtual +address of the memory, or an error (via PTR_ERR()) if any part of the +region is occupied. + + diff --git a/Documentation/DMA-mapping.txt b/Documentation/DMA-mapping.txt new file mode 100644 index 000000000000..f4ac37f157ea --- /dev/null +++ b/Documentation/DMA-mapping.txt @@ -0,0 +1,881 @@ + Dynamic DMA mapping + =================== + + David S. Miller + Richard Henderson + Jakub Jelinek + +This document describes the DMA mapping system in terms of the pci_ +API. For a similar API that works for generic devices, see +DMA-API.txt. + +Most of the 64bit platforms have special hardware that translates bus +addresses (DMA addresses) into physical addresses. This is similar to +how page tables and/or a TLB translates virtual addresses to physical +addresses on a CPU. This is needed so that e.g. PCI devices can +access with a Single Address Cycle (32bit DMA address) any page in the +64bit physical address space. Previously in Linux those 64bit +platforms had to set artificial limits on the maximum RAM size in the +system, so that the virt_to_bus() static scheme works (the DMA address +translation tables were simply filled on bootup to map each bus +address to the physical page __pa(bus_to_virt())). + +So that Linux can use the dynamic DMA mapping, it needs some help from the +drivers, namely it has to take into account that DMA addresses should be +mapped only for the time they are actually used and unmapped after the DMA +transfer. + +The following API will work of course even on platforms where no such +hardware exists, see e.g. include/asm-i386/pci.h for how it is implemented on +top of the virt_to_bus interface. + +First of all, you should make sure + +#include + +is in your driver. This file will obtain for you the definition of the +dma_addr_t (which can hold any valid DMA address for the platform) +type which should be used everywhere you hold a DMA (bus) address +returned from the DMA mapping functions. + + What memory is DMA'able? + +The first piece of information you must know is what kernel memory can +be used with the DMA mapping facilities. There has been an unwritten +set of rules regarding this, and this text is an attempt to finally +write them down. + +If you acquired your memory via the page allocator +(i.e. __get_free_page*()) or the generic memory allocators +(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from +that memory using the addresses returned from those routines. + +This means specifically that you may _not_ use the memory/addresses +returned from vmalloc() for DMA. It is possible to DMA to the +_underlying_ memory mapped into a vmalloc() area, but this requires +walking page tables to get the physical addresses, and then +translating each of those pages back to a kernel address using +something like __va(). [ EDIT: Update this when we integrate +Gerd Knorr's generic code which does this. ] + +This rule also means that you may not use kernel image addresses +(ie. items in the kernel's data/text/bss segment, or your driver's) +nor may you use kernel stack addresses for DMA. Both of these items +might be mapped somewhere entirely different than the rest of physical +memory. + +Also, this means that you cannot take the return of a kmap() +call and DMA to/from that. This is similar to vmalloc(). + +What about block I/O and networking buffers? The block I/O and +networking subsystems make sure that the buffers they use are valid +for you to DMA from/to. + + DMA addressing limitations + +Does your device have any DMA addressing limitations? For example, is +your device only capable of driving the low order 24-bits of address +on the PCI bus for SAC DMA transfers? If so, you need to inform the +PCI layer of this fact. + +By default, the kernel assumes that your device can address the full +32-bits in a SAC cycle. For a 64-bit DAC capable device, this needs +to be increased. And for a device with limitations, as discussed in +the previous paragraph, it needs to be decreased. + +pci_alloc_consistent() by default will return 32-bit DMA addresses. +PCI-X specification requires PCI-X devices to support 64-bit +addressing (DAC) for all transactions. And at least one platform (SGI +SN2) requires 64-bit consistent allocations to operate correctly when +the IO bus is in PCI-X mode. Therefore, like with pci_set_dma_mask(), +it's good practice to call pci_set_consistent_dma_mask() to set the +appropriate mask even if your device only supports 32-bit DMA +(default) and especially if it's a PCI-X device. + +For correct operation, you must interrogate the PCI layer in your +device probe routine to see if the PCI controller on the machine can +properly support the DMA addressing limitation your device has. It is +good style to do this even if your device holds the default setting, +because this shows that you did think about these issues wrt. your +device. + +The query is performed via a call to pci_set_dma_mask(): + + int pci_set_dma_mask(struct pci_dev *pdev, u64 device_mask); + +The query for consistent allocations is performed via a a call to +pci_set_consistent_dma_mask(): + + int pci_set_consistent_dma_mask(struct pci_dev *pdev, u64 device_mask); + +Here, pdev is a pointer to the PCI device struct of your device, and +device_mask is a bit mask describing which bits of a PCI address your +device supports. It returns zero if your card can perform DMA +properly on the machine given the address mask you provided. + +If it returns non-zero, your device can not perform DMA properly on +this platform, and attempting to do so will result in undefined +behavior. You must either use a different mask, or not use DMA. + +This means that in the failure case, you have three options: + +1) Use another DMA mask, if possible (see below). +2) Use some non-DMA mode for data transfer, if possible. +3) Ignore this device and do not initialize it. + +It is recommended that your driver print a kernel KERN_WARNING message +when you end up performing either #2 or #3. In this manner, if a user +of your driver reports that performance is bad or that the device is not +even detected, you can ask them for the kernel messages to find out +exactly why. + +The standard 32-bit addressing PCI device would do something like +this: + + if (pci_set_dma_mask(pdev, DMA_32BIT_MASK)) { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +Another common scenario is a 64-bit capable device. The approach +here is to try for 64-bit DAC addressing, but back down to a +32-bit mask should that fail. The PCI platform code may fail the +64-bit mask not because the platform is not capable of 64-bit +addressing. Rather, it may fail in this case simply because +32-bit SAC addressing is done more efficiently than DAC addressing. +Sparc64 is one platform which behaves in this way. + +Here is how you would handle a 64-bit capable device which can drive +all 64-bits when accessing streaming DMA: + + int using_dac; + + if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) { + using_dac = 1; + } else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) { + using_dac = 0; + } else { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +If a card is capable of using 64-bit consistent allocations as well, +the case would look like this: + + int using_dac, consistent_using_dac; + + if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) { + using_dac = 1; + consistent_using_dac = 1; + pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + } else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) { + using_dac = 0; + consistent_using_dac = 0; + pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + } else { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +pci_set_consistent_dma_mask() will always be able to set the same or a +smaller mask as pci_set_dma_mask(). However for the rare case that a +device driver only uses consistent allocations, one would have to +check the return value from pci_set_consistent_dma_mask(). + +If your 64-bit device is going to be an enormous consumer of DMA +mappings, this can be problematic since the DMA mappings are a +finite resource on many platforms. Please see the "DAC Addressing +for Address Space Hungry Devices" section near the end of this +document for how to handle this case. + +Finally, if your device can only drive the low 24-bits of +address during PCI bus mastering you might do something like: + + if (pci_set_dma_mask(pdev, 0x00ffffff)) { + printk(KERN_WARNING + "mydev: 24-bit DMA addressing not available.\n"); + goto ignore_this_device; + } + +When pci_set_dma_mask() is successful, and returns zero, the PCI layer +saves away this mask you have provided. The PCI layer will use this +information later when you make DMA mappings. + +There is a case which we are aware of at this time, which is worth +mentioning in this documentation. If your device supports multiple +functions (for example a sound card provides playback and record +functions) and the various different functions have _different_ +DMA addressing limitations, you may wish to probe each mask and +only provide the functionality which the machine can handle. It +is important that the last call to pci_set_dma_mask() be for the +most specific mask. + +Here is pseudo-code showing how this might be done: + + #define PLAYBACK_ADDRESS_BITS DMA_32BIT_MASK + #define RECORD_ADDRESS_BITS 0x00ffffff + + struct my_sound_card *card; + struct pci_dev *pdev; + + ... + if (!pci_set_dma_mask(pdev, PLAYBACK_ADDRESS_BITS)) { + card->playback_enabled = 1; + } else { + card->playback_enabled = 0; + printk(KERN_WARN "%s: Playback disabled due to DMA limitations.\n", + card->name); + } + if (!pci_set_dma_mask(pdev, RECORD_ADDRESS_BITS)) { + card->record_enabled = 1; + } else { + card->record_enabled = 0; + printk(KERN_WARN "%s: Record disabled due to DMA limitations.\n", + card->name); + } + +A sound card was used as an example here because this genre of PCI +devices seems to be littered with ISA chips given a PCI front end, +and thus retaining the 16MB DMA addressing limitations of ISA. + + Types of DMA mappings + +There are two types of DMA mappings: + +- Consistent DMA mappings which are usually mapped at driver + initialization, unmapped at the end and for which the hardware should + guarantee that the device and the CPU can access the data + in parallel and will see updates made by each other without any + explicit software flushing. + + Think of "consistent" as "synchronous" or "coherent". + + The current default is to return consistent memory in the low 32 + bits of the PCI bus space. However, for future compatibility you + should set the consistent mask even if this default is fine for your + driver. + + Good examples of what to use consistent mappings for are: + + - Network card DMA ring descriptors. + - SCSI adapter mailbox command data structures. + - Device firmware microcode executed out of + main memory. + + The invariant these examples all require is that any CPU store + to memory is immediately visible to the device, and vice + versa. Consistent mappings guarantee this. + + IMPORTANT: Consistent DMA memory does not preclude the usage of + proper memory barriers. The CPU may reorder stores to + consistent memory just as it may normal memory. Example: + if it is important for the device to see the first word + of a descriptor updated before the second, you must do + something like: + + desc->word0 = address; + wmb(); + desc->word1 = DESC_VALID; + + in order to get correct behavior on all platforms. + +- Streaming DMA mappings which are usually mapped for one DMA transfer, + unmapped right after it (unless you use pci_dma_sync_* below) and for which + hardware can optimize for sequential accesses. + + This of "streaming" as "asynchronous" or "outside the coherency + domain". + + Good examples of what to use streaming mappings for are: + + - Networking buffers transmitted/received by a device. + - Filesystem buffers written/read by a SCSI device. + + The interfaces for using this type of mapping were designed in + such a way that an implementation can make whatever performance + optimizations the hardware allows. To this end, when using + such mappings you must be explicit about what you want to happen. + +Neither type of DMA mapping has alignment restrictions that come +from PCI, although some devices may have such restrictions. + + Using Consistent DMA mappings. + +To allocate and map large (PAGE_SIZE or so) consistent DMA regions, +you should do: + + dma_addr_t dma_handle; + + cpu_addr = pci_alloc_consistent(dev, size, &dma_handle); + +where dev is a struct pci_dev *. You should pass NULL for PCI like buses +where devices don't have struct pci_dev (like ISA, EISA). This may be +called in interrupt context. + +This argument is needed because the DMA translations may be bus +specific (and often is private to the bus which the device is attached +to). + +Size is the length of the region you want to allocate, in bytes. + +This routine will allocate RAM for that region, so it acts similarly to +__get_free_pages (but takes size instead of a page order). If your +driver needs regions sized smaller than a page, you may prefer using +the pci_pool interface, described below. + +The consistent DMA mapping interfaces, for non-NULL dev, will by +default return a DMA address which is SAC (Single Address Cycle) +addressable. Even if the device indicates (via PCI dma mask) that it +may address the upper 32-bits and thus perform DAC cycles, consistent +allocation will only return > 32-bit PCI addresses for DMA if the +consistent dma mask has been explicitly changed via +pci_set_consistent_dma_mask(). This is true of the pci_pool interface +as well. + +pci_alloc_consistent returns two values: the virtual address which you +can use to access it from the CPU and dma_handle which you pass to the +card. + +The cpu return address and the DMA bus master address are both +guaranteed to be aligned to the smallest PAGE_SIZE order which +is greater than or equal to the requested size. This invariant +exists (for example) to guarantee that if you allocate a chunk +which is smaller than or equal to 64 kilobytes, the extent of the +buffer you receive will not cross a 64K boundary. + +To unmap and free such a DMA region, you call: + + pci_free_consistent(dev, size, cpu_addr, dma_handle); + +where dev, size are the same as in the above call and cpu_addr and +dma_handle are the values pci_alloc_consistent returned to you. +This function may not be called in interrupt context. + +If your driver needs lots of smaller memory regions, you can write +custom code to subdivide pages returned by pci_alloc_consistent, +or you can use the pci_pool API to do that. A pci_pool is like +a kmem_cache, but it uses pci_alloc_consistent not __get_free_pages. +Also, it understands common hardware constraints for alignment, +like queue heads needing to be aligned on N byte boundaries. + +Create a pci_pool like this: + + struct pci_pool *pool; + + pool = pci_pool_create(name, dev, size, align, alloc); + +The "name" is for diagnostics (like a kmem_cache name); dev and size +are as above. The device's hardware alignment requirement for this +type of data is "align" (which is expressed in bytes, and must be a +power of two). If your device has no boundary crossing restrictions, +pass 0 for alloc; passing 4096 says memory allocated from this pool +must not cross 4KByte boundaries (but at that time it may be better to +go for pci_alloc_consistent directly instead). + +Allocate memory from a pci pool like this: + + cpu_addr = pci_pool_alloc(pool, flags, &dma_handle); + +flags are SLAB_KERNEL if blocking is permitted (not in_interrupt nor +holding SMP locks), SLAB_ATOMIC otherwise. Like pci_alloc_consistent, +this returns two values, cpu_addr and dma_handle. + +Free memory that was allocated from a pci_pool like this: + + pci_pool_free(pool, cpu_addr, dma_handle); + +where pool is what you passed to pci_pool_alloc, and cpu_addr and +dma_handle are the values pci_pool_alloc returned. This function +may be called in interrupt context. + +Destroy a pci_pool by calling: + + pci_pool_destroy(pool); + +Make sure you've called pci_pool_free for all memory allocated +from a pool before you destroy the pool. This function may not +be called in interrupt context. + + DMA Direction + +The interfaces described in subsequent portions of this document +take a DMA direction argument, which is an integer and takes on +one of the following values: + + PCI_DMA_BIDIRECTIONAL + PCI_DMA_TODEVICE + PCI_DMA_FROMDEVICE + PCI_DMA_NONE + +One should provide the exact DMA direction if you know it. + +PCI_DMA_TODEVICE means "from main memory to the PCI device" +PCI_DMA_FROMDEVICE means "from the PCI device to main memory" +It is the direction in which the data moves during the DMA +transfer. + +You are _strongly_ encouraged to specify this as precisely +as you possibly can. + +If you absolutely cannot know the direction of the DMA transfer, +specify PCI_DMA_BIDIRECTIONAL. It means that the DMA can go in +either direction. The platform guarantees that you may legally +specify this, and that it will work, but this may be at the +cost of performance for example. + +The value PCI_DMA_NONE is to be used for debugging. One can +hold this in a data structure before you come to know the +precise direction, and this will help catch cases where your +direction tracking logic has failed to set things up properly. + +Another advantage of specifying this value precisely (outside of +potential platform-specific optimizations of such) is for debugging. +Some platforms actually have a write permission boolean which DMA +mappings can be marked with, much like page protections in the user +program address space. Such platforms can and do report errors in the +kernel logs when the PCI controller hardware detects violation of the +permission setting. + +Only streaming mappings specify a direction, consistent mappings +implicitly have a direction attribute setting of +PCI_DMA_BIDIRECTIONAL. + +The SCSI subsystem provides mechanisms for you to easily obtain +the direction to use, in the SCSI command: + + scsi_to_pci_dma_dir(SCSI_DIRECTION) + +Where SCSI_DIRECTION is obtained from the 'sc_data_direction' +member of the SCSI command your driver is working on. The +mentioned interface above returns a value suitable for passing +into the streaming DMA mapping interfaces below. + +For Networking drivers, it's a rather simple affair. For transmit +packets, map/unmap them with the PCI_DMA_TODEVICE direction +specifier. For receive packets, just the opposite, map/unmap them +with the PCI_DMA_FROMDEVICE direction specifier. + + Using Streaming DMA mappings + +The streaming DMA mapping routines can be called from interrupt +context. There are two versions of each map/unmap, one which will +map/unmap a single memory region, and one which will map/unmap a +scatterlist. + +To map a single region, you do: + + struct pci_dev *pdev = mydev->pdev; + dma_addr_t dma_handle; + void *addr = buffer->ptr; + size_t size = buffer->len; + + dma_handle = pci_map_single(dev, addr, size, direction); + +and to unmap it: + + pci_unmap_single(dev, dma_handle, size, direction); + +You should call pci_unmap_single when the DMA activity is finished, e.g. +from the interrupt which told you that the DMA transfer is done. + +Using cpu pointers like this for single mappings has a disadvantage, +you cannot reference HIGHMEM memory in this way. Thus, there is a +map/unmap interface pair akin to pci_{map,unmap}_single. These +interfaces deal with page/offset pairs instead of cpu pointers. +Specifically: + + struct pci_dev *pdev = mydev->pdev; + dma_addr_t dma_handle; + struct page *page = buffer->page; + unsigned long offset = buffer->offset; + size_t size = buffer->len; + + dma_handle = pci_map_page(dev, page, offset, size, direction); + + ... + + pci_unmap_page(dev, dma_handle, size, direction); + +Here, "offset" means byte offset within the given page. + +With scatterlists, you map a region gathered from several regions by: + + int i, count = pci_map_sg(dev, sglist, nents, direction); + struct scatterlist *sg; + + for (i = 0, sg = sglist; i < count; i++, sg++) { + hw_address[i] = sg_dma_address(sg); + hw_len[i] = sg_dma_len(sg); + } + +where nents is the number of entries in the sglist. + +The implementation is free to merge several consecutive sglist entries +into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any +consecutive sglist entries can be merged into one provided the first one +ends and the second one starts on a page boundary - in fact this is a huge +advantage for cards which either cannot do scatter-gather or have very +limited number of scatter-gather entries) and returns the actual number +of sg entries it mapped them to. On failure 0 is returned. + +Then you should loop count times (note: this can be less than nents times) +and use sg_dma_address() and sg_dma_len() macros where you previously +accessed sg->address and sg->length as shown above. + +To unmap a scatterlist, just call: + + pci_unmap_sg(dev, sglist, nents, direction); + +Again, make sure DMA activity has already finished. + +PLEASE NOTE: The 'nents' argument to the pci_unmap_sg call must be + the _same_ one you passed into the pci_map_sg call, + it should _NOT_ be the 'count' value _returned_ from the + pci_map_sg call. + +Every pci_map_{single,sg} call should have its pci_unmap_{single,sg} +counterpart, because the bus address space is a shared resource (although +in some ports the mapping is per each BUS so less devices contend for the +same bus address space) and you could render the machine unusable by eating +all bus addresses. + +If you need to use the same streaming DMA region multiple times and touch +the data in between the DMA transfers, the buffer needs to be synced +properly in order for the cpu and device to see the most uptodate and +correct copy of the DMA buffer. + +So, firstly, just map it with pci_map_{single,sg}, and after each DMA +transfer call either: + + pci_dma_sync_single_for_cpu(dev, dma_handle, size, direction); + +or: + + pci_dma_sync_sg_for_cpu(dev, sglist, nents, direction); + +as appropriate. + +Then, if you wish to let the device get at the DMA area again, +finish accessing the data with the cpu, and then before actually +giving the buffer to the hardware call either: + + pci_dma_sync_single_for_device(dev, dma_handle, size, direction); + +or: + + pci_dma_sync_sg_for_device(dev, sglist, nents, direction); + +as appropriate. + +After the last DMA transfer call one of the DMA unmap routines +pci_unmap_{single,sg}. If you don't touch the data from the first pci_map_* +call till pci_unmap_*, then you don't have to call the pci_dma_sync_* +routines at all. + +Here is pseudo code which shows a situation in which you would need +to use the pci_dma_sync_*() interfaces. + + my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) + { + dma_addr_t mapping; + + mapping = pci_map_single(cp->pdev, buffer, len, PCI_DMA_FROMDEVICE); + + cp->rx_buf = buffer; + cp->rx_len = len; + cp->rx_dma = mapping; + + give_rx_buf_to_card(cp); + } + + ... + + my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs) + { + struct my_card *cp = devid; + + ... + if (read_card_status(cp) == RX_BUF_TRANSFERRED) { + struct my_card_header *hp; + + /* Examine the header to see if we wish + * to accept the data. But synchronize + * the DMA transfer with the CPU first + * so that we see updated contents. + */ + pci_dma_sync_single_for_cpu(cp->pdev, cp->rx_dma, + cp->rx_len, + PCI_DMA_FROMDEVICE); + + /* Now it is safe to examine the buffer. */ + hp = (struct my_card_header *) cp->rx_buf; + if (header_is_ok(hp)) { + pci_unmap_single(cp->pdev, cp->rx_dma, cp->rx_len, + PCI_DMA_FROMDEVICE); + pass_to_upper_layers(cp->rx_buf); + make_and_setup_new_rx_buf(cp); + } else { + /* Just sync the buffer and give it back + * to the card. + */ + pci_dma_sync_single_for_device(cp->pdev, + cp->rx_dma, + cp->rx_len, + PCI_DMA_FROMDEVICE); + give_rx_buf_to_card(cp); + } + } + } + +Drivers converted fully to this interface should not use virt_to_bus any +longer, nor should they use bus_to_virt. Some drivers have to be changed a +little bit, because there is no longer an equivalent to bus_to_virt in the +dynamic DMA mapping scheme - you have to always store the DMA addresses +returned by the pci_alloc_consistent, pci_pool_alloc, and pci_map_single +calls (pci_map_sg stores them in the scatterlist itself if the platform +supports dynamic DMA mapping in hardware) in your driver structures and/or +in the card registers. + +All PCI drivers should be using these interfaces with no exceptions. +It is planned to completely remove virt_to_bus() and bus_to_virt() as +they are entirely deprecated. Some ports already do not provide these +as it is impossible to correctly support them. + + 64-bit DMA and DAC cycle support + +Do you understand all of the text above? Great, then you already +know how to use 64-bit DMA addressing under Linux. Simply make +the appropriate pci_set_dma_mask() calls based upon your cards +capabilities, then use the mapping APIs above. + +It is that simple. + +Well, not for some odd devices. See the next section for information +about that. + + DAC Addressing for Address Space Hungry Devices + +There exists a class of devices which do not mesh well with the PCI +DMA mapping API. By definition these "mappings" are a finite +resource. The number of total available mappings per bus is platform +specific, but there will always be a reasonable amount. + +What is "reasonable"? Reasonable means that networking and block I/O +devices need not worry about using too many mappings. + +As an example of a problematic device, consider compute cluster cards. +They can potentially need to access gigabytes of memory at once via +DMA. Dynamic mappings are unsuitable for this kind of access pattern. + +To this end we've provided a small API by which a device driver +may use DAC cycles to directly address all of physical memory. +Not all platforms support this, but most do. It is easy to determine +whether the platform will work properly at probe time. + +First, understand that there may be a SEVERE performance penalty for +using these interfaces on some platforms. Therefore, you MUST only +use these interfaces if it is absolutely required. %99 of devices can +use the normal APIs without any problems. + +Note that for streaming type mappings you must either use these +interfaces, or the dynamic mapping interfaces above. You may not mix +usage of both for the same device. Such an act is illegal and is +guaranteed to put a banana in your tailpipe. + +However, consistent mappings may in fact be used in conjunction with +these interfaces. Remember that, as defined, consistent mappings are +always going to be SAC addressable. + +The first thing your driver needs to do is query the PCI platform +layer with your devices DAC addressing capabilities: + + int pci_dac_set_dma_mask(struct pci_dev *pdev, u64 mask); + +This routine behaves identically to pci_set_dma_mask. You may not +use the following interfaces if this routine fails. + +Next, DMA addresses using this API are kept track of using the +dma64_addr_t type. It is guaranteed to be big enough to hold any +DAC address the platform layer will give to you from the following +routines. If you have consistent mappings as well, you still +use plain dma_addr_t to keep track of those. + +All mappings obtained here will be direct. The mappings are not +translated, and this is the purpose of this dialect of the DMA API. + +All routines work with page/offset pairs. This is the _ONLY_ way to +portably refer to any piece of memory. If you have a cpu pointer +(which may be validly DMA'd too) you may easily obtain the page +and offset using something like this: + + struct page *page = virt_to_page(ptr); + unsigned long offset = offset_in_page(ptr); + +Here are the interfaces: + + dma64_addr_t pci_dac_page_to_dma(struct pci_dev *pdev, + struct page *page, + unsigned long offset, + int direction); + +The DAC address for the tuple PAGE/OFFSET are returned. The direction +argument is the same as for pci_{map,unmap}_single(). The same rules +for cpu/device access apply here as for the streaming mapping +interfaces. To reiterate: + + The cpu may touch the buffer before pci_dac_page_to_dma. + The device may touch the buffer after pci_dac_page_to_dma + is made, but the cpu may NOT. + +When the DMA transfer is complete, invoke: + + void pci_dac_dma_sync_single_for_cpu(struct pci_dev *pdev, + dma64_addr_t dma_addr, + size_t len, int direction); + +This must be done before the CPU looks at the buffer again. +This interface behaves identically to pci_dma_sync_{single,sg}_for_cpu(). + +And likewise, if you wish to let the device get back at the buffer after +the cpu has read/written it, invoke: + + void pci_dac_dma_sync_single_for_device(struct pci_dev *pdev, + dma64_addr_t dma_addr, + size_t len, int direction); + +before letting the device access the DMA area again. + +If you need to get back to the PAGE/OFFSET tuple from a dma64_addr_t +the following interfaces are provided: + + struct page *pci_dac_dma_to_page(struct pci_dev *pdev, + dma64_addr_t dma_addr); + unsigned long pci_dac_dma_to_offset(struct pci_dev *pdev, + dma64_addr_t dma_addr); + +This is possible with the DAC interfaces purely because they are +not translated in any way. + + Optimizing Unmap State Space Consumption + +On many platforms, pci_unmap_{single,page}() is simply a nop. +Therefore, keeping track of the mapping address and length is a waste +of space. Instead of filling your drivers up with ifdefs and the like +to "work around" this (which would defeat the whole purpose of a +portable API) the following facilities are provided. + +Actually, instead of describing the macros one by one, we'll +transform some example code. + +1) Use DECLARE_PCI_UNMAP_{ADDR,LEN} in state saving structures. + Example, before: + + struct ring_state { + struct sk_buff *skb; + dma_addr_t mapping; + __u32 len; + }; + + after: + + struct ring_state { + struct sk_buff *skb; + DECLARE_PCI_UNMAP_ADDR(mapping) + DECLARE_PCI_UNMAP_LEN(len) + }; + + NOTE: DO NOT put a semicolon at the end of the DECLARE_*() + macro. + +2) Use pci_unmap_{addr,len}_set to set these values. + Example, before: + + ringp->mapping = FOO; + ringp->len = BAR; + + after: + + pci_unmap_addr_set(ringp, mapping, FOO); + pci_unmap_len_set(ringp, len, BAR); + +3) Use pci_unmap_{addr,len} to access these values. + Example, before: + + pci_unmap_single(pdev, ringp->mapping, ringp->len, + PCI_DMA_FROMDEVICE); + + after: + + pci_unmap_single(pdev, + pci_unmap_addr(ringp, mapping), + pci_unmap_len(ringp, len), + PCI_DMA_FROMDEVICE); + +It really should be self-explanatory. We treat the ADDR and LEN +separately, because it is possible for an implementation to only +need the address in order to perform the unmap operation. + + Platform Issues + +If you are just writing drivers for Linux and do not maintain +an architecture port for the kernel, you can safely skip down +to "Closing". + +1) Struct scatterlist requirements. + + Struct scatterlist must contain, at a minimum, the following + members: + + struct page *page; + unsigned int offset; + unsigned int length; + + The base address is specified by a "page+offset" pair. + + Previous versions of struct scatterlist contained a "void *address" + field that was sometimes used instead of page+offset. As of Linux + 2.5., page+offset is always used, and the "address" field has been + deleted. + +2) More to come... + + Handling Errors + +DMA address space is limited on some architectures and an allocation +failure can be determined by: + +- checking if pci_alloc_consistent returns NULL or pci_map_sg returns 0 + +- checking the returned dma_addr_t of pci_map_single and pci_map_page + by using pci_dma_mapping_error(): + + dma_addr_t dma_handle; + + dma_handle = pci_map_single(dev, addr, size, direction); + if (pci_dma_mapping_error(dma_handle)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + } + + Closing + +This document, and the API itself, would not be in it's current +form without the feedback and suggestions from numerous individuals. +We would like to specifically mention, in no particular order, the +following people: + + Russell King + Leo Dagum + Ralf Baechle + Grant Grundler + Jay Estabrook + Thomas Sailer + Andrea Arcangeli + Jens Axboe + David Mosberger-Tang diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile new file mode 100644 index 000000000000..a221039ee4c9 --- /dev/null +++ b/Documentation/DocBook/Makefile @@ -0,0 +1,195 @@ +### +# This makefile is used to generate the kernel documentation, +# primarily based on in-line comments in various source files. +# See Documentation/kernel-doc-nano-HOWTO.txt for instruction in how +# to ducument the SRC - and how to read it. +# To add a new book the only step required is to add the book to the +# list of DOCBOOKS. + +DOCBOOKS := wanbook.xml z8530book.xml mcabook.xml videobook.xml \ + kernel-hacking.xml kernel-locking.xml via-audio.xml \ + deviceiobook.xml procfs-guide.xml tulip-user.xml \ + writing_usb_driver.xml scsidrivers.xml sis900.xml \ + kernel-api.xml journal-api.xml lsm.xml usb.xml \ + gadget.xml libata.xml mtdnand.xml librs.xml + +### +# The build process is as follows (targets): +# (xmldocs) +# file.tmpl --> file.xml +--> file.ps (psdocs) +# +--> file.pdf (pdfdocs) +# +--> DIR=file (htmldocs) +# +--> man/ (mandocs) + +### +# The targets that may be used. +.PHONY: xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs + +BOOKS := $(addprefix $(obj)/,$(DOCBOOKS)) +xmldocs: $(BOOKS) +sgmldocs: xmldocs + +PS := $(patsubst %.xml, %.ps, $(BOOKS)) +psdocs: $(PS) + +PDF := $(patsubst %.xml, %.pdf, $(BOOKS)) +pdfdocs: $(PDF) + +HTML := $(patsubst %.xml, %.html, $(BOOKS)) +htmldocs: $(HTML) + +MAN := $(patsubst %.xml, %.9, $(BOOKS)) +mandocs: $(MAN) + +installmandocs: mandocs + $(MAKEMAN) install Documentation/DocBook/man + +### +#External programs used +KERNELDOC = scripts/kernel-doc +DOCPROC = scripts/basic/docproc +SPLITMAN = $(PERL) $(srctree)/scripts/split-man +MAKEMAN = $(PERL) $(srctree)/scripts/makeman + +### +# DOCPROC is used for two purposes: +# 1) To generate a dependency list for a .tmpl file +# 2) To preprocess a .tmpl file and call kernel-doc with +# appropriate parameters. +# The following rules are used to generate the .xml documentation +# required to generate the final targets. (ps, pdf, html). +quiet_cmd_docproc = DOCPROC $@ + cmd_docproc = SRCTREE=$(srctree)/ $(DOCPROC) doc $< >$@ +define rule_docproc + set -e; \ + $(if $($(quiet)cmd_$(1)),echo ' $($(quiet)cmd_$(1))';) \ + $(cmd_$(1)); \ + ( \ + echo 'cmd_$@ := $(cmd_$(1))'; \ + echo $@: `SRCTREE=$(srctree) $(DOCPROC) depend $<`; \ + ) > $(dir $@).$(notdir $@).cmd +endef + +%.xml: %.tmpl FORCE + $(call if_changed_rule,docproc) + +### +#Read in all saved dependency files +cmd_files := $(wildcard $(foreach f,$(BOOKS),$(dir $(f)).$(notdir $(f)).cmd)) + +ifneq ($(cmd_files),) + include $(cmd_files) +endif + +### +# Changes in kernel-doc force a rebuild of all documentation +$(BOOKS): $(KERNELDOC) + +### +# procfs guide uses a .c file as example code. +# This requires an explicit dependency +C-procfs-example = procfs_example.xml +C-procfs-example2 = $(addprefix $(obj)/,$(C-procfs-example)) +$(obj)/procfs-guide.xml: $(C-procfs-example2) + +### +# Rules to generate postscript, PDF and HTML +# db2html creates a directory. Generate a html file used for timestamp + +quiet_cmd_db2ps = DB2PS $@ + cmd_db2ps = db2ps -o $(dir $@) $< +%.ps : %.xml + @(which db2ps > /dev/null 2>&1) || \ + (echo "*** You need to install DocBook stylesheets ***"; \ + exit 1) + $(call cmd,db2ps) + +quiet_cmd_db2pdf = DB2PDF $@ + cmd_db2pdf = db2pdf -o $(dir $@) $< +%.pdf : %.xml + @(which db2pdf > /dev/null 2>&1) || \ + (echo "*** You need to install DocBook stylesheets ***"; \ + exit 1) + $(call cmd,db2pdf) + +quiet_cmd_db2html = DB2HTML $@ + cmd_db2html = db2html -o $(patsubst %.html,%,$@) $< && \ + echo ' \ + Goto $(patsubst %.html,%,$(notdir $@))

' > $@ + +%.html: %.xml + @(which db2html > /dev/null 2>&1) || \ + (echo "*** You need to install DocBook stylesheets ***"; \ + exit 1) + @rm -rf $@ $(patsubst %.html,%,$@) + $(call cmd,db2html) + @if [ ! -z "$(PNG-$(basename $(notdir $@)))" ]; then \ + cp $(PNG-$(basename $(notdir $@))) $(patsubst %.html,%,$@); fi + +### +# Rule to generate man files - output is placed in the man subdirectory + +%.9: %.xml +ifneq ($(KBUILD_SRC),) + $(Q)mkdir -p $(objtree)/Documentation/DocBook/man +endif + $(SPLITMAN) $< $(objtree)/Documentation/DocBook/man "$(VERSION).$(PATCHLEVEL).$(SUBLEVEL)" + $(MAKEMAN) convert $(objtree)/Documentation/DocBook/man $< + +### +# Rules to generate postscripts and PNG imgages from .fig format files +quiet_cmd_fig2eps = FIG2EPS $@ + cmd_fig2eps = fig2dev -Leps $< $@ + +%.eps: %.fig + @(which fig2dev > /dev/null 2>&1) || \ + (echo "*** You need to install transfig ***"; \ + exit 1) + $(call cmd,fig2eps) + +quiet_cmd_fig2png = FIG2PNG $@ + cmd_fig2png = fig2dev -Lpng $< $@ + +%.png: %.fig + @(which fig2dev > /dev/null 2>&1) || \ + (echo "*** You need to install transfig ***"; \ + exit 1) + $(call cmd,fig2png) + +### +# Rule to convert a .c file to inline XML documentation +%.xml: %.c + @echo ' GEN $@' + @( \ + echo ""; \ + expand --tabs=8 < $< | \ + sed -e "s/&/\\&/g" \ + -e "s//\\>/g"; \ + echo "") > $@ + +### +# Help targets as used by the top-level makefile +dochelp: + @echo ' Linux kernel internal documentation in different formats:' + @echo ' xmldocs (XML DocBook), psdocs (Postscript), pdfdocs (PDF)' + @echo ' htmldocs (HTML), mandocs (man pages, use installmandocs to install)' + +### +# Temporary files left by various tools +clean-files := $(DOCBOOKS) \ + $(patsubst %.xml, %.dvi, $(DOCBOOKS)) \ + $(patsubst %.xml, %.aux, $(DOCBOOKS)) \ + $(patsubst %.xml, %.tex, $(DOCBOOKS)) \ + $(patsubst %.xml, %.log, $(DOCBOOKS)) \ + $(patsubst %.xml, %.out, $(DOCBOOKS)) \ + $(patsubst %.xml, %.ps, $(DOCBOOKS)) \ + $(patsubst %.xml, %.pdf, $(DOCBOOKS)) \ + $(patsubst %.xml, %.html, $(DOCBOOKS)) \ + $(patsubst %.xml, %.9, $(DOCBOOKS)) \ + $(C-procfs-example) + +clean-dirs := $(patsubst %.xml,%,$(DOCBOOKS)) + +#man put files in man subdir - traverse down +subdir- := man/ diff --git a/Documentation/DocBook/deviceiobook.tmpl b/Documentation/DocBook/deviceiobook.tmpl new file mode 100644 index 000000000000..6f41f2f5c6f6 --- /dev/null +++ b/Documentation/DocBook/deviceiobook.tmpl @@ -0,0 +1,341 @@ + + + + + + Bus-Independent Device Accesses + + + + Matthew + Wilcox + +

+ matthew@wil.cx +
+ + + + + + + Alan + Cox + +
+ alan@redhat.com +
+
+
+
+ + + 2001 + Matthew Wilcox + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + + + + + + + Introduction + + Linux provides an API which abstracts performing IO across all busses + and devices, allowing device drivers to be written independently of + bus type. + + + + + Known Bugs And Assumptions + + None. + + + + + Memory Mapped IO + + Getting Access to the Device + + The most widely supported form of IO is memory mapped IO. + That is, a part of the CPU's address space is interpreted + not as accesses to memory, but as accesses to a device. Some + architectures define devices to be at a fixed address, but most + have some method of discovering devices. The PCI bus walk is a + good example of such a scheme. This document does not cover how + to receive such an address, but assumes you are starting with one. + Physical addresses are of type unsigned long. + + + + This address should not be used directly. Instead, to get an + address suitable for passing to the accessor functions described + below, you should call ioremap. + An address suitable for accessing the device will be returned to you. + + + + After you've finished using the device (say, in your module's + exit routine), call iounmap in order to return + the address space to the kernel. Most architectures allocate new + address space each time you call ioremap, and + they can run out unless you call iounmap. + + + + + Accessing the device + + The part of the interface most used by drivers is reading and + writing memory-mapped registers on the device. Linux provides + interfaces to read and write 8-bit, 16-bit, 32-bit and 64-bit + quantities. Due to a historical accident, these are named byte, + word, long and quad accesses. Both read and write accesses are + supported; there is no prefetch support at this time. + + + + The functions are named readb, + readw, readl, + readq, readb_relaxed, + readw_relaxed, readl_relaxed, + readq_relaxed, writeb, + writew, writel and + writeq. + + + + Some devices (such as framebuffers) would like to use larger + transfers than 8 bytes at a time. For these devices, the + memcpy_toio, memcpy_fromio + and memset_io functions are provided. + Do not use memset or memcpy on IO addresses; they + are not guaranteed to copy data in order. + + + + The read and write functions are defined to be ordered. That is the + compiler is not permitted to reorder the I/O sequence. When the + ordering can be compiler optimised, you can use + __readb and friends to indicate the relaxed ordering. Use + this with care. + + + + While the basic functions are defined to be synchronous with respect + to each other and ordered with respect to each other the busses the + devices sit on may themselves have asynchronicity. In particular many + authors are burned by the fact that PCI bus writes are posted + asynchronously. A driver author must issue a read from the same + device to ensure that writes have occurred in the specific cases the + author cares. This kind of property cannot be hidden from driver + writers in the API. In some cases, the read used to flush the device + may be expected to fail (if the card is resetting, for example). In + that case, the read should be done from config space, which is + guaranteed to soft-fail if the card doesn't respond. + + + + The following is an example of flushing a write to a device when + the driver would like to ensure the write's effects are visible prior + to continuing execution. + + + +static inline void +qla1280_disable_intrs(struct scsi_qla_host *ha) +{ + struct device_reg *reg; + + reg = ha->iobase; + /* disable risc and host interrupts */ + WRT_REG_WORD(&reg->ictrl, 0); + /* + * The following read will ensure that the above write + * has been received by the device before we return from this + * function. + */ + RD_REG_WORD(&reg->ictrl); + ha->flags.ints_enabled = 0; +} + + + + In addition to write posting, on some large multiprocessing systems + (e.g. SGI Challenge, Origin and Altix machines) posted writes won't + be strongly ordered coming from different CPUs. Thus it's important + to properly protect parts of your driver that do memory-mapped writes + with locks and use the mmiowb to make sure they + arrive in the order intended. Issuing a regular readX + will also ensure write ordering, but should only be used + when the driver has to be sure that the write has actually arrived + at the device (not that it's simply ordered with respect to other + writes), since a full readX is a relatively + expensive operation. + + + + Generally, one should use mmiowb prior to + releasing a spinlock that protects regions using writeb + or similar functions that aren't surrounded by + readb calls, which will ensure ordering and flushing. The + following pseudocode illustrates what might occur if write ordering + isn't guaranteed via mmiowb or one of the + readX functions. + + + +CPU A: spin_lock_irqsave(&dev_lock, flags) +CPU A: ... +CPU A: writel(newval, ring_ptr); +CPU A: spin_unlock_irqrestore(&dev_lock, flags) + ... +CPU B: spin_lock_irqsave(&dev_lock, flags) +CPU B: writel(newval2, ring_ptr); +CPU B: ... +CPU B: spin_unlock_irqrestore(&dev_lock, flags) + + + + In the case above, newval2 could be written to ring_ptr before + newval. Fixing it is easy though: + + + +CPU A: spin_lock_irqsave(&dev_lock, flags) +CPU A: ... +CPU A: writel(newval, ring_ptr); +CPU A: mmiowb(); /* ensure no other writes beat us to the device */ +CPU A: spin_unlock_irqrestore(&dev_lock, flags) + ... +CPU B: spin_lock_irqsave(&dev_lock, flags) +CPU B: writel(newval2, ring_ptr); +CPU B: ... +CPU B: mmiowb(); +CPU B: spin_unlock_irqrestore(&dev_lock, flags) + + + + See tg3.c for a real world example of how to use mmiowb + + + + + PCI ordering rules also guarantee that PIO read responses arrive + after any outstanding DMA writes from that bus, since for some devices + the result of a readb call may signal to the + driver that a DMA transaction is complete. In many cases, however, + the driver may want to indicate that the next + readb call has no relation to any previous DMA + writes performed by the device. The driver can use + readb_relaxed for these cases, although only + some platforms will honor the relaxed semantics. Using the relaxed + read functions will provide significant performance benefits on + platforms that support it. The qla2xxx driver provides examples + of how to use readX_relaxed. In many cases, + a majority of the driver's readX calls can + safely be converted to readX_relaxed calls, since + only a few will indicate or depend on DMA completion. + + + + + ISA legacy functions + + On older kernels (2.2 and earlier) the ISA bus could be read or + written with these functions and without ioremap being used. This is + no longer true in Linux 2.4. A set of equivalent functions exist for + easy legacy driver porting. The functions available are prefixed + with 'isa_' and are isa_readb, + isa_writeb, isa_readw, + isa_writew, isa_readl, + isa_writel, isa_memcpy_fromio + and isa_memcpy_toio + + + These functions should not be used in new drivers, and will + eventually be going away. + + + + + + + Port Space Accesses + + Port Space Explained + + + Another form of IO commonly supported is Port Space. This is a + range of addresses separate to the normal memory address space. + Access to these addresses is generally not as fast as accesses + to the memory mapped addresses, and it also has a potentially + smaller address space. + + + + Unlike memory mapped IO, no preparation is required + to access port space. + + + + + Accessing Port Space + + Accesses to this space are provided through a set of functions + which allow 8-bit, 16-bit and 32-bit accesses; also + known as byte, word and long. These functions are + inb, inw, + inl, outb, + outw and outl. + + + + Some variants are provided for these functions. Some devices + require that accesses to their ports are slowed down. This + functionality is provided by appending a _p + to the end of the function. There are also equivalents to memcpy. + The ins and outs + functions copy bytes, words or longs to the given port. + + + + + + + Public Functions Provided +!Einclude/asm-i386/io.h + + + diff --git a/Documentation/DocBook/gadget.tmpl b/Documentation/DocBook/gadget.tmpl new file mode 100644 index 000000000000..a34442436128 --- /dev/null +++ b/Documentation/DocBook/gadget.tmpl @@ -0,0 +1,752 @@ + + + + + + USB Gadget API for Linux + 20 August 2004 + 20 August 2004 + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + + + 2003-2004 + David Brownell + + + + David + Brownell + +
dbrownell@users.sourceforge.net
+
+
+
+ + + +Introduction + +This document presents a Linux-USB "Gadget" +kernel mode +API, for use within peripherals and other USB devices +that embed Linux. +It provides an overview of the API structure, +and shows how that fits into a system development project. +This is the first such API released on Linux to address +a number of important problems, including: + + + Supports USB 2.0, for high speed devices which + can stream data at several dozen megabytes per second. + + Handles devices with dozens of endpoints just as + well as ones with just two fixed-function ones. Gadget drivers + can be written so they're easy to port to new hardware. + + Flexible enough to expose more complex USB device + capabilities such as multiple configurations, multiple interfaces, + composite devices, + and alternate interface settings. + + USB "On-The-Go" (OTG) support, in conjunction + with updates to the Linux-USB host side. + + Sharing data structures and API models with the + Linux-USB host side API. This helps the OTG support, and + looks forward to more-symmetric frameworks (where the same + I/O model is used by both host and device side drivers). + + Minimalist, so it's easier to support new device + controller hardware. I/O processing doesn't imply large + demands for memory or CPU resources. + + + + +Most Linux developers will not be able to use this API, since they +have USB "host" hardware in a PC, workstation, or server. +Linux users with embedded systems are more likely to +have USB peripheral hardware. +To distinguish drivers running inside such hardware from the +more familiar Linux "USB device drivers", +which are host side proxies for the real USB devices, +a different term is used: +the drivers inside the peripherals are "USB gadget drivers". +In USB protocol interactions, the device driver is the master +(or "client driver") +and the gadget driver is the slave (or "function driver"). + + +The gadget API resembles the host side Linux-USB API in that both +use queues of request objects to package I/O buffers, and those requests +may be submitted or canceled. +They share common definitions for the standard USB +Chapter 9 messages, structures, and constants. +Also, both APIs bind and unbind drivers to devices. +The APIs differ in detail, since the host side's current +URB framework exposes a number of implementation details +and assumptions that are inappropriate for a gadget API. +While the model for control transfers and configuration +management is necessarily different (one side is a hardware-neutral master, +the other is a hardware-aware slave), the endpoint I/0 API used here +should also be usable for an overhead-reduced host side API. + + + + +Structure of Gadget Drivers + +A system running inside a USB peripheral +normally has at least three layers inside the kernel to handle +USB protocol processing, and may have additional layers in +user space code. +The "gadget" API is used by the middle layer to interact +with the lowest level (which directly handles hardware). + + +In Linux, from the bottom up, these layers are: + + + + + + USB Controller Driver + + + This is the lowest software level. + It is the only layer that talks to hardware, + through registers, fifos, dma, irqs, and the like. + The <linux/usb_gadget.h> API abstracts + the peripheral controller endpoint hardware. + That hardware is exposed through endpoint objects, which accept + streams of IN/OUT buffers, and through callbacks that interact + with gadget drivers. + Since normal USB devices only have one upstream + port, they only have one of these drivers. + The controller driver can support any number of different + gadget drivers, but only one of them can be used at a time. + + + Examples of such controller hardware include + the PCI-based NetChip 2280 USB 2.0 high speed controller, + the SA-11x0 or PXA-25x UDC (found within many PDAs), + and a variety of other products. + + + + + + Gadget Driver + + + The lower boundary of this driver implements hardware-neutral + USB functions, using calls to the controller driver. + Because such hardware varies widely in capabilities and restrictions, + and is used in embedded environments where space is at a premium, + the gadget driver is often configured at compile time + to work with endpoints supported by one particular controller. + Gadget drivers may be portable to several different controllers, + using conditional compilation. + (Recent kernels substantially simplify the work involved in + supporting new hardware, by autoconfiguring + endpoints automatically for many bulk-oriented drivers.) + Gadget driver responsibilities include: + + + handling setup requests (ep0 protocol responses) + possibly including class-specific functionality + + returning configuration and string descriptors + + (re)setting configurations and interface + altsettings, including enabling and configuring endpoints + + handling life cycle events, such as managing + bindings to hardware, + USB suspend/resume, remote wakeup, + and disconnection from the USB host. + + managing IN and OUT transfers on all currently + enabled endpoints + + + + + Such drivers may be modules of proprietary code, although + that approach is discouraged in the Linux community. + + + + + Upper Level + + + Most gadget drivers have an upper boundary that connects + to some Linux driver or framework in Linux. + Through that boundary flows the data which the gadget driver + produces and/or consumes through protocol transfers over USB. + Examples include: + + + user mode code, using generic (gadgetfs) + or application specific files in + /dev + + networking subsystem (for network gadgets, + like the CDC Ethernet Model gadget driver) + + data capture drivers, perhaps video4Linux or + a scanner driver; or test and measurement hardware. + + input subsystem (for HID gadgets) + + sound subsystem (for audio gadgets) + + file system (for PTP gadgets) + + block i/o subsystem (for usb-storage gadgets) + + ... and more + + + + + Additional Layers + + + Other layers may exist. + These could include kernel layers, such as network protocol stacks, + as well as user mode applications building on standard POSIX + system call APIs such as + open(), close(), + read() and write(). + On newer systems, POSIX Async I/O calls may be an option. + Such user mode code will not necessarily be subject to + the GNU General Public License (GPL). + + + + + + +OTG-capable systems will also need to include a standard Linux-USB +host side stack, +with usbcore, +one or more Host Controller Drivers (HCDs), +USB Device Drivers to support +the OTG "Targeted Peripheral List", +and so forth. +There will also be an OTG Controller Driver, +which is visible to gadget and device driver developers only indirectly. +That helps the host and device side USB controllers implement the +two new OTG protocols (HNP and SRP). +Roles switch (host to peripheral, or vice versa) using HNP +during USB suspend processing, and SRP can be viewed as a +more battery-friendly kind of device wakeup protocol. + + +Over time, reusable utilities are evolving to help make some +gadget driver tasks simpler. +For example, building configuration descriptors from vectors of +descriptors for the configurations interfaces and endpoints is +now automated, and many drivers now use autoconfiguration to +choose hardware endpoints and initialize their descriptors. + +A potential example of particular interest +is code implementing standard USB-IF protocols for +HID, networking, storage, or audio classes. +Some developers are interested in KDB or KGDB hooks, to let +target hardware be remotely debugged. +Most such USB protocol code doesn't need to be hardware-specific, +any more than network protocols like X11, HTTP, or NFS are. +Such gadget-side interface drivers should eventually be combined, +to implement composite devices. + + + + + +Kernel Mode Gadget API + +Gadget drivers declare themselves through a +struct usb_gadget_driver, which is responsible for +most parts of enumeration for a struct usb_gadget. +The response to a set_configuration usually involves +enabling one or more of the struct usb_ep objects +exposed by the gadget, and submitting one or more +struct usb_request buffers to transfer data. +Understand those four data types, and their operations, and +you will understand how this API works. + + +Incomplete Data Type Descriptions + +This documentation was prepared using the standard Linux +kernel docproc tool, which turns text +and in-code comments into SGML DocBook and then into usable +formats such as HTML or PDF. +Other than the "Chapter 9" data types, most of the significant +data types and functions are described here. + + +However, docproc does not understand all the C constructs +that are used, so some relevant information is likely omitted from +what you are reading. +One example of such information is endpoint autoconfiguration. +You'll have to read the header file, and use example source +code (such as that for "Gadget Zero"), to fully understand the API. + + +The part of the API implementing some basic +driver capabilities is specific to the version of the +Linux kernel that's in use. +The 2.6 kernel includes a driver model +framework that has no analogue on earlier kernels; +so those parts of the gadget API are not fully portable. +(They are implemented on 2.4 kernels, but in a different way.) +The driver model state is another part of this API that is +ignored by the kerneldoc tools. + + + +The core API does not expose +every possible hardware feature, only the most widely available ones. +There are significant hardware features, such as device-to-device DMA +(without temporary storage in a memory buffer) +that would be added using hardware-specific APIs. + + +This API allows drivers to use conditional compilation to handle +endpoint capabilities of different hardware, but doesn't require that. +Hardware tends to have arbitrary restrictions, relating to +transfer types, addressing, packet sizes, buffering, and availability. +As a rule, such differences only matter for "endpoint zero" logic +that handles device configuration and management. +The API supports limited run-time +detection of capabilities, through naming conventions for endpoints. +Many drivers will be able to at least partially autoconfigure +themselves. +In particular, driver init sections will often have endpoint +autoconfiguration logic that scans the hardware's list of endpoints +to find ones matching the driver requirements +(relying on those conventions), to eliminate some of the most +common reasons for conditional compilation. + + +Like the Linux-USB host side API, this API exposes +the "chunky" nature of USB messages: I/O requests are in terms +of one or more "packets", and packet boundaries are visible to drivers. +Compared to RS-232 serial protocols, USB resembles +synchronous protocols like HDLC +(N bytes per frame, multipoint addressing, host as the primary +station and devices as secondary stations) +more than asynchronous ones +(tty style: 8 data bits per frame, no parity, one stop bit). +So for example the controller drivers won't buffer +two single byte writes into a single two-byte USB IN packet, +although gadget drivers may do so when they implement +protocols where packet boundaries (and "short packets") +are not significant. + + +Driver Life Cycle + +Gadget drivers make endpoint I/O requests to hardware without +needing to know many details of the hardware, but driver +setup/configuration code needs to handle some differences. +Use the API like this: + + + + +Register a driver for the particular device side +usb controller hardware, +such as the net2280 on PCI (USB 2.0), +sa11x0 or pxa25x as found in Linux PDAs, +and so on. +At this point the device is logically in the USB ch9 initial state +("attached"), drawing no power and not usable +(since it does not yet support enumeration). +Any host should not see the device, since it's not +activated the data line pullup used by the host to +detect a device, even if VBUS power is available. + + +Register a gadget driver that implements some higher level +device function. That will then bind() to a usb_gadget, which +activates the data line pullup sometime after detecting VBUS. + + +The hardware driver can now start enumerating. +The steps it handles are to accept USB power and set_address requests. +Other steps are handled by the gadget driver. +If the gadget driver module is unloaded before the host starts to +enumerate, steps before step 7 are skipped. + + +The gadget driver's setup() call returns usb descriptors, +based both on what the bus interface hardware provides and on the +functionality being implemented. +That can involve alternate settings or configurations, +unless the hardware prevents such operation. +For OTG devices, each configuration descriptor includes +an OTG descriptor. + + +The gadget driver handles the last step of enumeration, +when the USB host issues a set_configuration call. +It enables all endpoints used in that configuration, +with all interfaces in their default settings. +That involves using a list of the hardware's endpoints, enabling each +endpoint according to its descriptor. +It may also involve using usb_gadget_vbus_draw +to let more power be drawn from VBUS, as allowed by that configuration. +For OTG devices, setting a configuration may also involve reporting +HNP capabilities through a user interface. + + +Do real work and perform data transfers, possibly involving +changes to interface settings or switching to new configurations, until the +device is disconnect()ed from the host. +Queue any number of transfer requests to each endpoint. +It may be suspended and resumed several times before being disconnected. +On disconnect, the drivers go back to step 3 (above). + + +When the gadget driver module is being unloaded, +the driver unbind() callback is issued. That lets the controller +driver be unloaded. + + + + +Drivers will normally be arranged so that just loading the +gadget driver module (or statically linking it into a Linux kernel) +allows the peripheral device to be enumerated, but some drivers +will defer enumeration until some higher level component (like +a user mode daemon) enables it. +Note that at this lowest level there are no policies about how +ep0 configuration logic is implemented, +except that it should obey USB specifications. +Such issues are in the domain of gadget drivers, +including knowing about implementation constraints +imposed by some USB controllers +or understanding that composite devices might happen to +be built by integrating reusable components. + + +Note that the lifecycle above can be slightly different +for OTG devices. +Other than providing an additional OTG descriptor in each +configuration, only the HNP-related differences are particularly +visible to driver code. +They involve reporting requirements during the SET_CONFIGURATION +request, and the option to invoke HNP during some suspend callbacks. +Also, SRP changes the semantics of +usb_gadget_wakeup +slightly. + + + + +USB 2.0 Chapter 9 Types and Constants + +Gadget drivers +rely on common USB structures and constants +defined in the +<linux/usb_ch9.h> +header file, which is standard in Linux 2.6 kernels. +These are the same types and constants used by host +side drivers (and usbcore). + + +!Iinclude/linux/usb_ch9.h + + +Core Objects and Methods + +These are declared in +<linux/usb_gadget.h>, +and are used by gadget drivers to interact with +USB peripheral controller drivers. + + + + +!Iinclude/linux/usb_gadget.h + + +Optional Utilities + +The core API is sufficient for writing a USB Gadget Driver, +but some optional utilities are provided to simplify common tasks. +These utilities include endpoint autoconfiguration. + + +!Edrivers/usb/gadget/usbstring.c +!Edrivers/usb/gadget/config.c + + + + + +Peripheral Controller Drivers + +The first hardware supporting this API was the NetChip 2280 +controller, which supports USB 2.0 high speed and is based on PCI. +This is the net2280 driver module. +The driver supports Linux kernel versions 2.4 and 2.6; +contact NetChip Technologies for development boards and product +information. + + +Other hardware working in the "gadget" framework includes: +Intel's PXA 25x and IXP42x series processors +(pxa2xx_udc), +Toshiba TC86c001 "Goku-S" (goku_udc), +Renesas SH7705/7727 (sh_udc), +MediaQ 11xx (mq11xx_udc), +Hynix HMS30C7202 (h7202_udc), +National 9303/4 (n9604_udc), +Texas Instruments OMAP (omap_udc), +Sharp LH7A40x (lh7a40x_udc), +and more. +Most of those are full speed controllers. + + +At this writing, there are people at work on drivers in +this framework for several other USB device controllers, +with plans to make many of them be widely available. + + + + +A partial USB simulator, +the dummy_hcd driver, is available. +It can act like a net2280, a pxa25x, or an sa11x0 in terms +of available endpoints and device speeds; and it simulates +control, bulk, and to some extent interrupt transfers. +That lets you develop some parts of a gadget driver on a normal PC, +without any special hardware, and perhaps with the assistance +of tools such as GDB running with User Mode Linux. +At least one person has expressed interest in adapting that +approach, hooking it up to a simulator for a microcontroller. +Such simulators can help debug subsystems where the runtime hardware +is unfriendly to software development, or is not yet available. + + +Support for other controllers is expected to be developed +and contributed +over time, as this driver framework evolves. + + + + +Gadget Drivers + +In addition to Gadget Zero +(used primarily for testing and development with drivers +for usb controller hardware), other gadget drivers exist. + + +There's an ethernet gadget +driver, which implements one of the most useful +Communications Device Class (CDC) models. +One of the standards for cable modem interoperability even +specifies the use of this ethernet model as one of two +mandatory options. +Gadgets using this code look to a USB host as if they're +an Ethernet adapter. +It provides access to a network where the gadget's CPU is one host, +which could easily be bridging, routing, or firewalling +access to other networks. +Since some hardware can't fully implement the CDC Ethernet +requirements, this driver also implements a "good parts only" +subset of CDC Ethernet. +(That subset doesn't advertise itself as CDC Ethernet, +to avoid creating problems.) + + +Support for Microsoft's RNDIS +protocol has been contributed by Pengutronix and Auerswald GmbH. +This is like CDC Ethernet, but it runs on more slightly USB hardware +(but less than the CDC subset). +However, its main claim to fame is being able to connect directly to +recent versions of Windows, using drivers that Microsoft bundles +and supports, making it much simpler to network with Windows. + + +There is also support for user mode gadget drivers, +using gadgetfs. +This provides a User Mode API that presents +each endpoint as a single file descriptor. I/O is done using +normal read() and read() calls. +Familiar tools like GDB and pthreads can be used to +develop and debug user mode drivers, so that once a robust +controller driver is available many applications for it +won't require new kernel mode software. +Linux 2.6 Async I/O (AIO) +support is available, so that user mode software +can stream data with only slightly more overhead +than a kernel driver. + + +There's a USB Mass Storage class driver, which provides +a different solution for interoperability with systems such +as MS-Windows and MacOS. +That File-backed Storage driver uses a +file or block device as backing store for a drive, +like the loop driver. +The USB host uses the BBB, CB, or CBI versions of the mass +storage class specification, using transparent SCSI commands +to access the data from the backing store. + + +There's a "serial line" driver, useful for TTY style +operation over USB. +The latest version of that driver supports CDC ACM style +operation, like a USB modem, and so on most hardware it can +interoperate easily with MS-Windows. +One interesting use of that driver is in boot firmware (like a BIOS), +which can sometimes use that model with very small systems without +real serial lines. + + +Support for other kinds of gadget is expected to +be developed and contributed +over time, as this driver framework evolves. + + + + +USB On-The-GO (OTG) + +USB OTG support on Linux 2.6 was initially developed +by Texas Instruments for +OMAP 16xx and 17xx +series processors. +Other OTG systems should work in similar ways, but the +hardware level details could be very different. + + +Systems need specialized hardware support to implement OTG, +notably including a special Mini-AB jack +and associated transciever to support Dual-Role +operation: +they can act either as a host, using the standard +Linux-USB host side driver stack, +or as a peripheral, using this "gadget" framework. +To do that, the system software relies on small additions +to those programming interfaces, +and on a new internal component (here called an "OTG Controller") +affecting which driver stack connects to the OTG port. +In each role, the system can re-use the existing pool of +hardware-neutral drivers, layered on top of the controller +driver interfaces (usb_bus or +usb_gadget). +Such drivers need at most minor changes, and most of the calls +added to support OTG can also benefit non-OTG products. + + + + Gadget drivers test the is_otg + flag, and use it to determine whether or not to include + an OTG descriptor in each of their configurations. + + Gadget drivers may need changes to support the + two new OTG protocols, exposed in new gadget attributes + such as b_hnp_enable flag. + HNP support should be reported through a user interface + (two LEDs could suffice), and is triggered in some cases + when the host suspends the peripheral. + SRP support can be user-initiated just like remote wakeup, + probably by pressing the same button. + + On the host side, USB device drivers need + to be taught to trigger HNP at appropriate moments, using + usb_suspend_device(). + That also conserves battery power, which is useful even + for non-OTG configurations. + + Also on the host side, a driver must support the + OTG "Targeted Peripheral List". That's just a whitelist, + used to reject peripherals not supported with a given + Linux OTG host. + This whitelist is product-specific; + each product must modify otg_whitelist.h + to match its interoperability specification. + + + Non-OTG Linux hosts, like PCs and workstations, + normally have some solution for adding drivers, so that + peripherals that aren't recognized can eventually be supported. + That approach is unreasonable for consumer products that may + never have their firmware upgraded, and where it's usually + unrealistic to expect traditional PC/workstation/server kinds + of support model to work. + For example, it's often impractical to change device firmware + once the product has been distributed, so driver bugs can't + normally be fixed if they're found after shipment. + + + + +Additional changes are needed below those hardware-neutral +usb_bus and usb_gadget +driver interfaces; those aren't discussed here in any detail. +Those affect the hardware-specific code for each USB Host or Peripheral +controller, and how the HCD initializes (since OTG can be active only +on a single port). +They also involve what may be called an OTG Controller +Driver, managing the OTG transceiver and the OTG state +machine logic as well as much of the root hub behavior for the +OTG port. +The OTG controller driver needs to activate and deactivate USB +controllers depending on the relevant device role. +Some related changes were needed inside usbcore, so that it +can identify OTG-capable devices and respond appropriately +to HNP or SRP protocols. + + + + +
+ diff --git a/Documentation/DocBook/journal-api.tmpl b/Documentation/DocBook/journal-api.tmpl new file mode 100644 index 000000000000..1ef6f43c6d8f --- /dev/null +++ b/Documentation/DocBook/journal-api.tmpl @@ -0,0 +1,333 @@ + + + + + + The Linux Journalling API + + + Roger + Gammans + +
+ rgammans@computer-surgery.co.uk +
+
+
+
+ + + + Stephen + Tweedie + +
+ sct@redhat.com +
+
+
+
+ + + 2002 + Roger Gammans + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + +
+ + + + + Overview + + Details + +The journalling layer is easy to use. You need to +first of all create a journal_t data structure. There are +two calls to do this dependent on how you decide to allocate the physical +media on which the journal resides. The journal_init_inode() call +is for journals stored in filesystem inodes, or the journal_init_dev() +call can be use for journal stored on a raw device (in a continuous range +of blocks). A journal_t is a typedef for a struct pointer, so when +you are finally finished make sure you call journal_destroy() on it +to free up any used kernel memory. + + + +Once you have got your journal_t object you need to 'mount' or load the journal +file, unless of course you haven't initialised it yet - in which case you +need to call journal_create(). + + + +Most of the time however your journal file will already have been created, but +before you load it you must call journal_wipe() to empty the journal file. +Hang on, you say , what if the filesystem wasn't cleanly umount()'d . Well, it is the +job of the client file system to detect this and skip the call to journal_wipe(). + + + +In either case the next call should be to journal_load() which prepares the +journal file for use. Note that journal_wipe(..,0) calls journal_skip_recovery() +for you if it detects any outstanding transactions in the journal and similarly +journal_load() will call journal_recover() if necessary. +I would advise reading fs/ext3/super.c for examples on this stage. +[RGG: Why is the journal_wipe() call necessary - doesn't this needlessly +complicate the API. Or isn't a good idea for the journal layer to hide +dirty mounts from the client fs] + + + +Now you can go ahead and start modifying the underlying +filesystem. Almost. + + + + + +You still need to actually journal your filesystem changes, this +is done by wrapping them into transactions. Additionally you +also need to wrap the modification of each of the the buffers +with calls to the journal layer, so it knows what the modifications +you are actually making are. To do this use journal_start() which +returns a transaction handle. + + + +journal_start() +and its counterpart journal_stop(), which indicates the end of a transaction +are nestable calls, so you can reenter a transaction if necessary, +but remember you must call journal_stop() the same number of times as +journal_start() before the transaction is completed (or more accurately +leaves the the update phase). Ext3/VFS makes use of this feature to simplify +quota support. + + + +Inside each transaction you need to wrap the modifications to the +individual buffers (blocks). Before you start to modify a buffer you +need to call journal_get_{create,write,undo}_access() as appropriate, +this allows the journalling layer to copy the unmodified data if it +needs to. After all the buffer may be part of a previously uncommitted +transaction. +At this point you are at last ready to modify a buffer, and once +you are have done so you need to call journal_dirty_{meta,}data(). +Or if you've asked for access to a buffer you now know is now longer +required to be pushed back on the device you can call journal_forget() +in much the same way as you might have used bforget() in the past. + + + +A journal_flush() may be called at any time to commit and checkpoint +all your transactions. + + + +Then at umount time , in your put_super() (2.4) or write_super() (2.5) +you can then call journal_destroy() to clean up your in-core journal object. + + + + +Unfortunately there a couple of ways the journal layer can cause a deadlock. +The first thing to note is that each task can only have +a single outstanding transaction at any one time, remember nothing +commits until the outermost journal_stop(). This means +you must complete the transaction at the end of each file/inode/address +etc. operation you perform, so that the journalling system isn't re-entered +on another journal. Since transactions can't be nested/batched +across differing journals, and another filesystem other than +yours (say ext3) may be modified in a later syscall. + + + +The second case to bear in mind is that journal_start() can +block if there isn't enough space in the journal for your transaction +(based on the passed nblocks param) - when it blocks it merely(!) needs to +wait for transactions to complete and be committed from other tasks, +so essentially we are waiting for journal_stop(). So to avoid +deadlocks you must treat journal_start/stop() as if they +were semaphores and include them in your semaphore ordering rules to prevent +deadlocks. Note that journal_extend() has similar blocking behaviour to +journal_start() so you can deadlock here just as easily as on journal_start(). + + + +Try to reserve the right number of blocks the first time. ;-). This will +be the maximum number of blocks you are going to touch in this transaction. +I advise having a look at at least ext3_jbd.h to see the basis on which +ext3 uses to make these decisions. + + + +Another wriggle to watch out for is your on-disk block allocation strategy. +why? Because, if you undo a delete, you need to ensure you haven't reused any +of the freed blocks in a later transaction. One simple way of doing this +is make sure any blocks you allocate only have checkpointed transactions +listed against them. Ext3 does this in ext3_test_allocatable(). + + + +Lock is also providing through journal_{un,}lock_updates(), +ext3 uses this when it wants a window with a clean and stable fs for a moment. +eg. + + + + + journal_lock_updates() //stop new stuff happening.. + journal_flush() // checkpoint everything. + ..do stuff on stable fs + journal_unlock_updates() // carry on with filesystem use. + + + +The opportunities for abuse and DOS attacks with this should be obvious, +if you allow unprivileged userspace to trigger codepaths containing these +calls. + + + +A new feature of jbd since 2.5.25 is commit callbacks with the new +journal_callback_set() function you can now ask the journalling layer +to call you back when the transaction is finally committed to disk, so that +you can do some of your own management. The key to this is the journal_callback +struct, this maintains the internal callback information but you can +extend it like this:- + + + struct myfs_callback_s { + //Data structure element required by jbd.. + struct journal_callback for_jbd; + // Stuff for myfs allocated together. + myfs_inode* i_commited; + + } + + + +this would be useful if you needed to know when data was committed to a +particular inode. + + + + + +Summary + +Using the journal is a matter of wrapping the different context changes, +being each mount, each modification (transaction) and each changed buffer +to tell the journalling layer about them. + + + +Here is a some pseudo code to give you an idea of how it works, as +an example. + + + + journal_t* my_jnrl = journal_create(); + journal_init_{dev,inode}(jnrl,...) + if (clean) journal_wipe(); + journal_load(); + + foreach(transaction) { /*transactions must be + completed before + a syscall returns to + userspace*/ + + handle_t * xct=journal_start(my_jnrl); + foreach(bh) { + journal_get_{create,write,undo}_access(xact,bh); + if ( myfs_modify(bh) ) { /* returns true + if makes changes */ + journal_dirty_{meta,}data(xact,bh); + } else { + journal_forget(bh); + } + } + journal_stop(xct); + } + journal_destroy(my_jrnl); + + + + + + + Data Types + + The journalling layer uses typedefs to 'hide' the concrete definitions + of the structures used. As a client of the JBD layer you can + just rely on the using the pointer as a magic cookie of some sort. + + Obviously the hiding is not enforced as this is 'C'. + + Structures +!Iinclude/linux/jbd.h + + + + + Functions + + The functions here are split into two groups those that + affect a journal as a whole, and those which are used to + manage transactions + + Journal Level +!Efs/jbd/journal.c +!Efs/jbd/recovery.c + + Transasction Level +!Efs/jbd/transaction.c + + + + See also + + + + Journaling the Linux ext2fs Filesystem,LinuxExpo 98, Stephen Tweedie + + + + + + + Ext3 Journalling FileSystem , OLS 2000, Dr. Stephen Tweedie + + + + + +
diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl new file mode 100644 index 000000000000..1bd20c860285 --- /dev/null +++ b/Documentation/DocBook/kernel-api.tmpl @@ -0,0 +1,342 @@ + + + + + + The Linux Kernel API + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + + + + + + + Driver Basics + Driver Entry and Exit points +!Iinclude/linux/init.h + + + Atomic and pointer manipulation +!Iinclude/asm-i386/atomic.h +!Iinclude/asm-i386/unaligned.h + + + + + + + Data Types + Doubly Linked Lists +!Iinclude/linux/list.h + + + + + Basic C Library Functions + + + When writing drivers, you cannot in general use routines which are + from the C Library. Some of the functions have been found generally + useful and they are listed below. The behaviour of these functions + may vary slightly from those defined by ANSI, and these deviations + are noted in the text. + + + String Conversions +!Ilib/vsprintf.c +!Elib/vsprintf.c + + String Manipulation +!Ilib/string.c +!Elib/string.c + + Bit Operations +!Iinclude/asm-i386/bitops.h + + + + + Memory Management in Linux + The Slab Cache +!Emm/slab.c + + User Space Memory Access +!Iinclude/asm-i386/uaccess.h +!Iarch/i386/lib/usercopy.c + + + + + FIFO Buffer + kfifo interface +!Iinclude/linux/kfifo.h +!Ekernel/kfifo.c + + + + + The proc filesystem + + sysctl interface +!Ekernel/sysctl.c + + + + + The debugfs filesystem + + debugfs interface +!Efs/debugfs/inode.c +!Efs/debugfs/file.c + + + + + The Linux VFS + The Directory Cache +!Efs/dcache.c +!Iinclude/linux/dcache.h + + Inode Handling +!Efs/inode.c +!Efs/bad_inode.c + + Registration and Superblocks +!Efs/super.c + + File Locks +!Efs/locks.c +!Ifs/locks.c + + + + + Linux Networking + Socket Buffer Functions +!Iinclude/linux/skbuff.h +!Enet/core/skbuff.c + + Socket Filter +!Enet/core/filter.c + + Generic Network Statistics +!Iinclude/linux/gen_stats.h +!Enet/core/gen_stats.c +!Enet/core/gen_estimator.c + + + + + Network device support + Driver Support +!Enet/core/dev.c + + 8390 Based Network Cards +!Edrivers/net/8390.c + + Synchronous PPP +!Edrivers/net/wan/syncppp.c + + + + + Module Support + Module Loading +!Ekernel/kmod.c + + Inter Module support + + Refer to the file kernel/module.c for more information. + + + + + + + Hardware Interfaces + Interrupt Handling +!Iarch/i386/kernel/irq.c + + + MTRR Handling +!Earch/i386/kernel/cpu/mtrr/main.c + + PCI Support Library +!Edrivers/pci/pci.c + + PCI Hotplug Support Library +!Edrivers/pci/hotplug/pci_hotplug_core.c + + MCA Architecture + MCA Device Functions + + Refer to the file arch/i386/kernel/mca.c for more information. + + + + MCA Bus DMA +!Iinclude/asm-i386/mca_dma.h + + + + + + The Device File System +!Efs/devfs/base.c + + + + Security Framework +!Esecurity/security.c + + + + Power Management +!Ekernel/power/pm.c + + + + Block Devices +!Edrivers/block/ll_rw_blk.c + + + + Miscellaneous Devices +!Edrivers/char/misc.c + + + + Video4Linux +!Edrivers/media/video/videodev.c + + + + Sound Devices +!Esound/sound_core.c + + + + + 16x50 UART Driver +!Edrivers/serial/serial_core.c +!Edrivers/serial/8250.c + + + + Z85230 Support Library +!Edrivers/net/wan/z85230.c + + + + Frame Buffer Library + + + The frame buffer drivers depend heavily on four data structures. + These structures are declared in include/linux/fb.h. They are + fb_info, fb_var_screeninfo, fb_fix_screeninfo and fb_monospecs. + The last three can be made available to and from userland. + + + + fb_info defines the current state of a particular video card. + Inside fb_info, there exists a fb_ops structure which is a + collection of needed functions to make fbdev and fbcon work. + fb_info is only visible to the kernel. + + + + fb_var_screeninfo is used to describe the features of a video card + that are user defined. With fb_var_screeninfo, things such as + depth and the resolution may be defined. + + + + The next structure is fb_fix_screeninfo. This defines the + properties of a card that are created when a mode is set and can't + be changed otherwise. A good example of this is the start of the + frame buffer memory. This "locks" the address of the frame buffer + memory, so that it cannot be changed or moved. + + + + The last structure is fb_monospecs. In the old API, there was + little importance for fb_monospecs. This allowed for forbidden things + such as setting a mode of 800x600 on a fix frequency monitor. With + the new API, fb_monospecs prevents such things, and if used + correctly, can prevent a monitor from being cooked. fb_monospecs + will not be useful until kernels 2.5.x. + + + Frame Buffer Memory +!Edrivers/video/fbmem.c + + Frame Buffer Console +!Edrivers/video/console/fbcon.c + + Frame Buffer Colormap +!Edrivers/video/fbcmap.c + + + Frame Buffer Video Mode Database +!Idrivers/video/modedb.c +!Edrivers/video/modedb.c + + Frame Buffer Macintosh Video Mode Database +!Idrivers/video/macmodes.c + + Frame Buffer Fonts + + Refer to the file drivers/video/console/fonts.c for more information. + + + + + diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl new file mode 100644 index 000000000000..49a9ef82d575 --- /dev/null +++ b/Documentation/DocBook/kernel-hacking.tmpl @@ -0,0 +1,1349 @@ + + + + + + Unreliable Guide To Hacking The Linux Kernel + + + + Paul + Rusty + Russell + +
+ rusty@rustcorp.com.au +
+
+
+
+ + + 2001 + Rusty Russell + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + + + + This is the first release of this document as part of the kernel tarball. + + +
+ + + + + Introduction + + Welcome, gentle reader, to Rusty's Unreliable Guide to Linux + Kernel Hacking. This document describes the common routines and + general requirements for kernel code: its goal is to serve as a + primer for Linux kernel development for experienced C + programmers. I avoid implementation details: that's what the + code is for, and I ignore whole tracts of useful routines. + + + Before you read this, please understand that I never wanted to + write this document, being grossly under-qualified, but I always + wanted to read it, and this was the only way. I hope it will + grow into a compendium of best practice, common starting points + and random information. + + + + + The Players + + + At any time each of the CPUs in a system can be: + + + + + + not associated with any process, serving a hardware interrupt; + + + + + + not associated with any process, serving a softirq, tasklet or bh; + + + + + + running in kernel space, associated with a process; + + + + + + running a process in user space. + + + + + + There is a strict ordering between these: other than the last + category (userspace) each can only be pre-empted by those above. + For example, while a softirq is running on a CPU, no other + softirq will pre-empt it, but a hardware interrupt can. However, + any other CPUs in the system execute independently. + + + + We'll see a number of ways that the user context can block + interrupts, to become truly non-preemptable. + + + + User Context + + + User context is when you are coming in from a system call or + other trap: you can sleep, and you own the CPU (except for + interrupts) until you call schedule(). + In other words, user context (unlike userspace) is not pre-emptable. + + + + + You are always in user context on module load and unload, + and on operations on the block device layer. + + + + + In user context, the current pointer (indicating + the task we are currently executing) is valid, and + in_interrupt() + (include/linux/interrupt.h) is false + . + + + + + Beware that if you have interrupts or bottom halves disabled + (see below), in_interrupt() will return a + false positive. + + + + + + Hardware Interrupts (Hard IRQs) + + + Timer ticks, network cards and + keyboard are examples of real + hardware which produce interrupts at any time. The kernel runs + interrupt handlers, which services the hardware. The kernel + guarantees that this handler is never re-entered: if another + interrupt arrives, it is queued (or dropped). Because it + disables interrupts, this handler has to be fast: frequently it + simply acknowledges the interrupt, marks a `software interrupt' + for execution and exits. + + + + You can tell you are in a hardware interrupt, because + in_irq() returns true. + + + + Beware that this will return a false positive if interrupts are disabled + (see below). + + + + + + Software Interrupt Context: Bottom Halves, Tasklets, softirqs + + + Whenever a system call is about to return to userspace, or a + hardware interrupt handler exits, any `software interrupts' + which are marked pending (usually by hardware interrupts) are + run (kernel/softirq.c). + + + + Much of the real interrupt handling work is done here. Early in + the transition to SMP, there were only `bottom + halves' (BHs), which didn't take advantage of multiple CPUs. Shortly + after we switched from wind-up computers made of match-sticks and snot, + we abandoned this limitation. + + + + include/linux/interrupt.h lists the + different BH's. No matter how many CPUs you have, no two BHs will run at + the same time. This made the transition to SMP simpler, but sucks hard for + scalable performance. A very important bottom half is the timer + BH (include/linux/timer.h): you + can register to have it call functions for you in a given length of time. + + + + 2.3.43 introduced softirqs, and re-implemented the (now + deprecated) BHs underneath them. Softirqs are fully-SMP + versions of BHs: they can run on as many CPUs at once as + required. This means they need to deal with any races in shared + data using their own locks. A bitmask is used to keep track of + which are enabled, so the 32 available softirqs should not be + used up lightly. (Yes, people will + notice). + + + + tasklets (include/linux/interrupt.h) + are like softirqs, except they are dynamically-registrable (meaning you + can have as many as you want), and they also guarantee that any tasklet + will only run on one CPU at any time, although different tasklets can + run simultaneously (unlike different BHs). + + + + The name `tasklet' is misleading: they have nothing to do with `tasks', + and probably more to do with some bad vodka Alexey Kuznetsov had at the + time. + + + + + You can tell you are in a softirq (or bottom half, or tasklet) + using the in_softirq() macro + (include/linux/interrupt.h). + + + + Beware that this will return a false positive if a bh lock (see below) + is held. + + + + + + + Some Basic Rules + + + + No memory protection + + + If you corrupt memory, whether in user context or + interrupt context, the whole machine will crash. Are you + sure you can't do what you want in userspace? + + + + + + No floating point or MMX + + + The FPU context is not saved; even in user + context the FPU state probably won't + correspond with the current process: you would mess with some + user process' FPU state. If you really want + to do this, you would have to explicitly save/restore the full + FPU state (and avoid context switches). It + is generally a bad idea; use fixed point arithmetic first. + + + + + + A rigid stack limit + + + The kernel stack is about 6K in 2.2 (for most + architectures: it's about 14K on the Alpha), and shared + with interrupts so you can't use it all. Avoid deep + recursion and huge local arrays on the stack (allocate + them dynamically instead). + + + + + + The Linux kernel is portable + + + Let's keep it that way. Your code should be 64-bit clean, + and endian-independent. You should also minimize CPU + specific stuff, e.g. inline assembly should be cleanly + encapsulated and minimized to ease porting. Generally it + should be restricted to the architecture-dependent part of + the kernel tree. + + + + + + + + ioctls: Not writing a new system call + + + A system call generally looks like this + + + +asmlinkage long sys_mycall(int arg) +{ + return 0; +} + + + + First, in most cases you don't want to create a new system call. + You create a character device and implement an appropriate ioctl + for it. This is much more flexible than system calls, doesn't have + to be entered in every architecture's + include/asm/unistd.h and + arch/kernel/entry.S file, and is much more + likely to be accepted by Linus. + + + + If all your routine does is read or write some parameter, consider + implementing a sysctl interface instead. + + + + Inside the ioctl you're in user context to a process. When a + error occurs you return a negated errno (see + include/linux/errno.h), + otherwise you return 0. + + + + After you slept you should check if a signal occurred: the + Unix/Linux way of handling signals is to temporarily exit the + system call with the -ERESTARTSYS error. The + system call entry code will switch back to user context, process + the signal handler and then your system call will be restarted + (unless the user disabled that). So you should be prepared to + process the restart, e.g. if you're in the middle of manipulating + some data structure. + + + +if (signal_pending()) + return -ERESTARTSYS; + + + + If you're doing longer computations: first think userspace. If you + really want to do it in kernel you should + regularly check if you need to give up the CPU (remember there is + cooperative multitasking per CPU). Idiom: + + + +cond_resched(); /* Will sleep */ + + + + A short note on interface design: the UNIX system call motto is + "Provide mechanism not policy". + + + + + Recipes for Deadlock + + + You cannot call any routines which may sleep, unless: + + + + + You are in user context. + + + + + + You do not own any spinlocks. + + + + + + You have interrupts enabled (actually, Andi Kleen says + that the scheduling code will enable them for you, but + that's probably not what you wanted). + + + + + + Note that some functions may sleep implicitly: common ones are + the user space access functions (*_user) and memory allocation + functions without GFP_ATOMIC. + + + + You will eventually lock up your box if you break these rules. + + + + Really. + + + + + Common Routines + + + + <function>printk()</function> + <filename class="headerfile">include/linux/kernel.h</filename> + + + + printk() feeds kernel messages to the + console, dmesg, and the syslog daemon. It is useful for debugging + and reporting errors, and can be used inside interrupt context, + but use with caution: a machine which has its console flooded with + printk messages is unusable. It uses a format string mostly + compatible with ANSI C printf, and C string concatenation to give + it a first "priority" argument: + + + +printk(KERN_INFO "i = %u\n", i); + + + + See include/linux/kernel.h; + for other KERN_ values; these are interpreted by syslog as the + level. Special case: for printing an IP address use + + + +__u32 ipaddress; +printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); + + + + printk() internally uses a 1K buffer and does + not catch overruns. Make sure that will be enough. + + + + + You will know when you are a real kernel hacker + when you start typoing printf as printk in your user programs :) + + + + + + + + Another sidenote: the original Unix Version 6 sources had a + comment on top of its printf function: "Printf should not be + used for chit-chat". You should follow that advice. + + + + + + + <function>copy_[to/from]_user()</function> + / + <function>get_user()</function> + / + <function>put_user()</function> + <filename class="headerfile">include/asm/uaccess.h</filename> + + + + [SLEEPS] + + + + put_user() and get_user() + are used to get and put single values (such as an int, char, or + long) from and to userspace. A pointer into userspace should + never be simply dereferenced: data should be copied using these + routines. Both return -EFAULT or 0. + + + copy_to_user() and + copy_from_user() are more general: they copy + an arbitrary amount of data to and from userspace. + + + Unlike put_user() and + get_user(), they return the amount of + uncopied data (ie. 0 still means + success). + + + [Yes, this moronic interface makes me cringe. Please submit a + patch and become my hero --RR.] + + + The functions may sleep implicitly. This should never be called + outside user context (it makes no sense), with interrupts + disabled, or a spinlock held. + + + + + <function>kmalloc()</function>/<function>kfree()</function> + <filename class="headerfile">include/linux/slab.h</filename> + + + [MAY SLEEP: SEE BELOW] + + + + These routines are used to dynamically request pointer-aligned + chunks of memory, like malloc and free do in userspace, but + kmalloc() takes an extra flag word. + Important values: + + + + + + + GFP_KERNEL + + + + + May sleep and swap to free memory. Only allowed in user + context, but is the most reliable way to allocate memory. + + + + + + + + GFP_ATOMIC + + + + + Don't sleep. Less reliable than GFP_KERNEL, + but may be called from interrupt context. You should + really have a good out-of-memory + error-handling strategy. + + + + + + + + GFP_DMA + + + + + Allocate ISA DMA lower than 16MB. If you don't know what that + is you don't need it. Very unreliable. + + + + + + + If you see a kmem_grow: Called nonatomically from int + warning message you called a memory allocation function + from interrupt context without GFP_ATOMIC. + You should really fix that. Run, don't walk. + + + + If you are allocating at least PAGE_SIZE + (include/asm/page.h) bytes, + consider using __get_free_pages() + + (include/linux/mm.h). It + takes an order argument (0 for page sized, 1 for double page, 2 + for four pages etc.) and the same memory priority flag word as + above. + + + + If you are allocating more than a page worth of bytes you can use + vmalloc(). It'll allocate virtual memory in + the kernel map. This block is not contiguous in physical memory, + but the MMU makes it look like it is for you + (so it'll only look contiguous to the CPUs, not to external device + drivers). If you really need large physically contiguous memory + for some weird device, you have a problem: it is poorly supported + in Linux because after some time memory fragmentation in a running + kernel makes it hard. The best way is to allocate the block early + in the boot process via the alloc_bootmem() + routine. + + + + Before inventing your own cache of often-used objects consider + using a slab cache in + include/linux/slab.h + + + + + <function>current</function> + <filename class="headerfile">include/asm/current.h</filename> + + + This global variable (really a macro) contains a pointer to + the current task structure, so is only valid in user context. + For example, when a process makes a system call, this will + point to the task structure of the calling process. It is + not NULL in interrupt context. + + + + + <function>udelay()</function>/<function>mdelay()</function> + <filename class="headerfile">include/asm/delay.h</filename> + <filename class="headerfile">include/linux/delay.h</filename> + + + + The udelay() function can be used for small pauses. + Do not use large values with udelay() as you risk + overflow - the helper function mdelay() is useful + here, or even consider schedule_timeout(). + + + + + <function>cpu_to_be32()</function>/<function>be32_to_cpu()</function>/<function>cpu_to_le32()</function>/<function>le32_to_cpu()</function> + <filename class="headerfile">include/asm/byteorder.h</filename> + + + + The cpu_to_be32() family (where the "32" can + be replaced by 64 or 16, and the "be" can be replaced by "le") are + the general way to do endian conversions in the kernel: they + return the converted value. All variations supply the reverse as + well: be32_to_cpu(), etc. + + + + There are two major variations of these functions: the pointer + variation, such as cpu_to_be32p(), which take + a pointer to the given type, and return the converted value. The + other variation is the "in-situ" family, such as + cpu_to_be32s(), which convert value referred + to by the pointer, and return void. + + + + + <function>local_irq_save()</function>/<function>local_irq_restore()</function> + <filename class="headerfile">include/asm/system.h</filename> + + + + These routines disable hard interrupts on the local CPU, and + restore them. They are reentrant; saving the previous state in + their one unsigned long flags argument. If you + know that interrupts are enabled, you can simply use + local_irq_disable() and + local_irq_enable(). + + + + + <function>local_bh_disable()</function>/<function>local_bh_enable()</function> + <filename class="headerfile">include/linux/interrupt.h</filename> + + + These routines disable soft interrupts on the local CPU, and + restore them. They are reentrant; if soft interrupts were + disabled before, they will still be disabled after this pair + of functions has been called. They prevent softirqs, tasklets + and bottom halves from running on the current CPU. + + + + + <function>smp_processor_id</function>() + <filename class="headerfile">include/asm/smp.h</filename> + + + smp_processor_id() returns the current + processor number, between 0 and NR_CPUS (the + maximum number of CPUs supported by Linux, currently 32). These + values are not necessarily continuous. + + + + + <type>__init</type>/<type>__exit</type>/<type>__initdata</type> + <filename class="headerfile">include/linux/init.h</filename> + + + After boot, the kernel frees up a special section; functions + marked with __init and data structures marked with + __initdata are dropped after boot is complete (within + modules this directive is currently ignored). __exit + is used to declare a function which is only required on exit: the + function will be dropped if this file is not compiled as a module. + See the header file for use. Note that it makes no sense for a function + marked with __init to be exported to modules with + EXPORT_SYMBOL() - this will break. + + + Static data structures marked as __initdata must be initialised + (as opposed to ordinary static data which is zeroed BSS) and cannot be + const. + + + + + + <function>__initcall()</function>/<function>module_init()</function> + <filename class="headerfile">include/linux/init.h</filename> + + Many parts of the kernel are well served as a module + (dynamically-loadable parts of the kernel). Using the + module_init() and + module_exit() macros it is easy to write code + without #ifdefs which can operate both as a module or built into + the kernel. + + + + The module_init() macro defines which + function is to be called at module insertion time (if the file is + compiled as a module), or at boot time: if the file is not + compiled as a module the module_init() macro + becomes equivalent to __initcall(), which + through linker magic ensures that the function is called on boot. + + + + The function can return a negative error number to cause + module loading to fail (unfortunately, this has no effect if + the module is compiled into the kernel). For modules, this is + called in user context, with interrupts enabled, and the + kernel lock held, so it can sleep. + + + + + <function>module_exit()</function> + <filename class="headerfile">include/linux/init.h</filename> + + + This macro defines the function to be called at module removal + time (or never, in the case of the file compiled into the + kernel). It will only be called if the module usage count has + reached zero. This function can also sleep, but cannot fail: + everything must be cleaned up by the time it returns. + + + + + + + + Wait Queues + <filename class="headerfile">include/linux/wait.h</filename> + + + [SLEEPS] + + + + A wait queue is used to wait for someone to wake you up when a + certain condition is true. They must be used carefully to ensure + there is no race condition. You declare a + wait_queue_head_t, and then processes which want to + wait for that condition declare a wait_queue_t + referring to themselves, and place that in the queue. + + + + Declaring + + + You declare a wait_queue_head_t using the + DECLARE_WAIT_QUEUE_HEAD() macro, or using the + init_waitqueue_head() routine in your + initialization code. + + + + + Queuing + + + Placing yourself in the waitqueue is fairly complex, because you + must put yourself in the queue before checking the condition. + There is a macro to do this: + wait_event_interruptible() + + include/linux/sched.h The + first argument is the wait queue head, and the second is an + expression which is evaluated; the macro returns + 0 when this expression is true, or + -ERESTARTSYS if a signal is received. + The wait_event() version ignores signals. + + + Do not use the sleep_on() function family - + it is very easy to accidentally introduce races; almost certainly + one of the wait_event() family will do, or a + loop around schedule_timeout(). If you choose + to loop around schedule_timeout() remember + you must set the task state (with + set_current_state()) on each iteration to avoid + busy-looping. + + + + + + Waking Up Queued Tasks + + + Call wake_up() + + include/linux/sched.h;, + which will wake up every process in the queue. The exception is + if one has TASK_EXCLUSIVE set, in which case + the remainder of the queue will not be woken. + + + + + + Atomic Operations + + + Certain operations are guaranteed atomic on all platforms. The + first class of operations work on atomic_t + + include/asm/atomic.h; this + contains a signed integer (at least 24 bits long), and you must use + these functions to manipulate or read atomic_t variables. + atomic_read() and + atomic_set() get and set the counter, + atomic_add(), + atomic_sub(), + atomic_inc(), + atomic_dec(), and + atomic_dec_and_test() (returns + true if it was decremented to zero). + + + + Yes. It returns true (i.e. != 0) if the + atomic variable is zero. + + + + Note that these functions are slower than normal arithmetic, and + so should not be used unnecessarily. On some platforms they + are much slower, like 32-bit Sparc where they use a spinlock. + + + + The second class of atomic operations is atomic bit operations on a + long, defined in + + include/linux/bitops.h. These + operations generally take a pointer to the bit pattern, and a bit + number: 0 is the least significant bit. + set_bit(), clear_bit() + and change_bit() set, clear, and flip the + given bit. test_and_set_bit(), + test_and_clear_bit() and + test_and_change_bit() do the same thing, + except return true if the bit was previously set; these are + particularly useful for very simple locking. + + + + It is possible to call these operations with bit indices greater + than BITS_PER_LONG. The resulting behavior is strange on big-endian + platforms though so it is a good idea not to do this. + + + + Note that the order of bits depends on the architecture, and in + particular, the bitfield passed to these operations must be at + least as large as a long. + + + + + Symbols + + + Within the kernel proper, the normal linking rules apply + (ie. unless a symbol is declared to be file scope with the + static keyword, it can be used anywhere in the + kernel). However, for modules, a special exported symbol table is + kept which limits the entry points to the kernel proper. Modules + can also export symbols. + + + + <function>EXPORT_SYMBOL()</function> + <filename class="headerfile">include/linux/module.h</filename> + + + This is the classic method of exporting a symbol, and it works + for both modules and non-modules. In the kernel all these + declarations are often bundled into a single file to help + genksyms (which searches source files for these declarations). + See the comment on genksyms and Makefiles below. + + + + + <function>EXPORT_SYMBOL_GPL()</function> + <filename class="headerfile">include/linux/module.h</filename> + + + Similar to EXPORT_SYMBOL() except that the + symbols exported by EXPORT_SYMBOL_GPL() can + only be seen by modules with a + MODULE_LICENSE() that specifies a GPL + compatible license. + + + + + + Routines and Conventions + + + Double-linked lists + <filename class="headerfile">include/linux/list.h</filename> + + + There are three sets of linked-list routines in the kernel + headers, but this one seems to be winning out (and Linus has + used it). If you don't have some particular pressing need for + a single list, it's a good choice. In fact, I don't care + whether it's a good choice or not, just use it so we can get + rid of the others. + + + + + Return Conventions + + + For code called in user context, it's very common to defy C + convention, and return 0 for success, + and a negative error number + (eg. -EFAULT) for failure. This can be + unintuitive at first, but it's fairly widespread in the networking + code, for example. + + + + The filesystem code uses ERR_PTR() + + include/linux/fs.h; to + encode a negative error number into a pointer, and + IS_ERR() and PTR_ERR() + to get it back out again: avoids a separate pointer parameter for + the error number. Icky, but in a good way. + + + + + Breaking Compilation + + + Linus and the other developers sometimes change function or + structure names in development kernels; this is not done just to + keep everyone on their toes: it reflects a fundamental change + (eg. can no longer be called with interrupts on, or does extra + checks, or doesn't do checks which were caught before). Usually + this is accompanied by a fairly complete note to the linux-kernel + mailing list; search the archive. Simply doing a global replace + on the file usually makes things worse. + + + + + Initializing structure members + + + The preferred method of initializing structures is to use + designated initialisers, as defined by ISO C99, eg: + + +static struct block_device_operations opt_fops = { + .open = opt_open, + .release = opt_release, + .ioctl = opt_ioctl, + .check_media_change = opt_media_change, +}; + + + This makes it easy to grep for, and makes it clear which + structure fields are set. You should do this because it looks + cool. + + + + + GNU Extensions + + + GNU Extensions are explicitly allowed in the Linux kernel. + Note that some of the more complex ones are not very well + supported, due to lack of general use, but the following are + considered standard (see the GCC info page section "C + Extensions" for more details - Yes, really the info page, the + man page is only a short summary of the stuff in info): + + + + + Inline functions + + + + + Statement expressions (ie. the ({ and }) constructs). + + + + + Declaring attributes of a function / variable / type + (__attribute__) + + + + + typeof + + + + + Zero length arrays + + + + + Macro varargs + + + + + Arithmetic on void pointers + + + + + Non-Constant initializers + + + + + Assembler Instructions (not outside arch/ and include/asm/) + + + + + Function names as strings (__FUNCTION__) + + + + + __builtin_constant_p() + + + + + + Be wary when using long long in the kernel, the code gcc generates for + it is horrible and worse: division and multiplication does not work + on i386 because the GCC runtime functions for it are missing from + the kernel environment. + + + + + + + C++ + + + Using C++ in the kernel is usually a bad idea, because the + kernel does not provide the necessary runtime environment + and the include files are not tested for it. It is still + possible, but not recommended. If you really want to do + this, forget about exceptions at least. + + + + + #if + + + It is generally considered cleaner to use macros in header files + (or at the top of .c files) to abstract away functions rather than + using `#if' pre-processor statements throughout the source code. + + + + + + Putting Your Stuff in the Kernel + + + In order to get your stuff into shape for official inclusion, or + even to make a neat patch, there's administrative work to be + done: + + + + + Figure out whose pond you've been pissing in. Look at the top of + the source files, inside the MAINTAINERS + file, and last of all in the CREDITS file. + You should coordinate with this person to make sure you're not + duplicating effort, or trying something that's already been + rejected. + + + + Make sure you put your name and EMail address at the top of + any files you create or mangle significantly. This is the + first place people will look when they find a bug, or when + they want to make a change. + + + + + + Usually you want a configuration option for your kernel hack. + Edit Config.in in the appropriate directory + (but under arch/ it's called + config.in). The Config Language used is not + bash, even though it looks like bash; the safe way is to use only + the constructs that you already see in + Config.in files (see + Documentation/kbuild/kconfig-language.txt). + It's good to run "make xconfig" at least once to test (because + it's the only one with a static parser). + + + + Variables which can be Y or N use bool followed by a + tagline and the config define name (which must start with + CONFIG_). The tristate function is the same, but + allows the answer M (which defines + CONFIG_foo_MODULE in your source, instead of + CONFIG_FOO) if CONFIG_MODULES + is enabled. + + + + You may well want to make your CONFIG option only visible if + CONFIG_EXPERIMENTAL is enabled: this serves as a + warning to users. There many other fancy things you can do: see + the various Config.in files for ideas. + + + + + + Edit the Makefile: the CONFIG variables are + exported here so you can conditionalize compilation with `ifeq'. + If your file exports symbols then add the names to + export-objs so that genksyms will find them. + + + There is a restriction on the kernel build system that objects + which export symbols must have globally unique names. + If your object does not have a globally unique name then the + standard fix is to move the + EXPORT_SYMBOL() statements to their own + object with a unique name. + This is why several systems have separate exporting objects, + usually suffixed with ksyms. + + + + + + + + Document your option in Documentation/Configure.help. Mention + incompatibilities and issues here. Definitely + end your description with if in doubt, say N + (or, occasionally, `Y'); this is for people who have no + idea what you are talking about. + + + + + + Put yourself in CREDITS if you've done + something noteworthy, usually beyond a single file (your name + should be at the top of the source files anyway). + MAINTAINERS means you want to be consulted + when changes are made to a subsystem, and hear about bugs; it + implies a more-than-passing commitment to some part of the code. + + + + + + Finally, don't forget to read Documentation/SubmittingPatches + and possibly Documentation/SubmittingDrivers. + + + + + + + Kernel Cantrips + + + Some favorites from browsing the source. Feel free to add to this + list. + + + + include/linux/brlock.h: + + +extern inline void br_read_lock (enum brlock_indices idx) +{ + /* + * This causes a link-time bug message if an + * invalid index is used: + */ + if (idx >= __BR_END) + __br_lock_usage_bug(); + + read_lock(&__brlock_array[smp_processor_id()][idx]); +} + + + + include/linux/fs.h: + + +/* + * Kernel pointers have redundant information, so we can use a + * scheme where we can return either an error code or a dentry + * pointer with the same return value. + * + * This should be a per-architecture thing, to allow different + * error and pointer decisions. + */ + #define ERR_PTR(err) ((void *)((long)(err))) + #define PTR_ERR(ptr) ((long)(ptr)) + #define IS_ERR(ptr) ((unsigned long)(ptr) > (unsigned long)(-1000)) + + + + include/asm-i386/uaccess.h: + + + +#define copy_to_user(to,from,n) \ + (__builtin_constant_p(n) ? \ + __constant_copy_to_user((to),(from),(n)) : \ + __generic_copy_to_user((to),(from),(n))) + + + + arch/sparc/kernel/head.S: + + + +/* + * Sun people can't spell worth damn. "compatability" indeed. + * At least we *know* we can't spell, and use a spell-checker. + */ + +/* Uh, actually Linus it is I who cannot spell. Too much murky + * Sparc assembly will do this to ya. + */ +C_LABEL(cputypvar): + .asciz "compatability" + +/* Tested on SS-5, SS-10. Probably someone at Sun applied a spell-checker. */ + .align 4 +C_LABEL(cputypvar_sun4m): + .asciz "compatible" + + + + arch/sparc/lib/checksum.S: + + + + /* Sun, you just can't beat me, you just can't. Stop trying, + * give up. I'm serious, I am going to kick the living shit + * out of you, game over, lights out. + */ + + + + + Thanks + + + Thanks to Andi Kleen for the idea, answering my questions, fixing + my mistakes, filling content, etc. Philipp Rumpf for more spelling + and clarity fixes, and some excellent non-obvious points. Werner + Almesberger for giving me a great summary of + disable_irq(), and Jes Sorensen and Andrea + Arcangeli added caveats. Michael Elizabeth Chastain for checking + and adding to the Configure section. Telsa Gwynne for teaching me DocBook. + + +
+ diff --git a/Documentation/DocBook/kernel-locking.tmpl b/Documentation/DocBook/kernel-locking.tmpl new file mode 100644 index 000000000000..90dc2de8e0af --- /dev/null +++ b/Documentation/DocBook/kernel-locking.tmpl @@ -0,0 +1,2088 @@ + + + + + + Unreliable Guide To Locking + + + + Rusty + Russell + +
+ rusty@rustcorp.com.au +
+
+
+
+ + + 2003 + Rusty Russell + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + +
+ + + + Introduction + + Welcome, to Rusty's Remarkably Unreliable Guide to Kernel + Locking issues. This document describes the locking systems in + the Linux Kernel in 2.6. + + + With the wide availability of HyperThreading, and preemption in the Linux + Kernel, everyone hacking on the kernel needs to know the + fundamentals of concurrency and locking for + SMP. + + + + + The Problem With Concurrency + + (Skip this if you know what a Race Condition is). + + + In a normal program, you can increment a counter like so: + + + very_important_count++; + + + + This is what they would expect to happen: + + + + Expected Results + + + + + + Instance 1 + Instance 2 + + + + + + read very_important_count (5) + + + + add 1 (6) + + + + write very_important_count (6) + + + + + read very_important_count (6) + + + + add 1 (7) + + + + write very_important_count (7) + + + + +
+ + + This is what might happen: + + + + Possible Results + + + + + Instance 1 + Instance 2 + + + + + + read very_important_count (5) + + + + + read very_important_count (5) + + + add 1 (6) + + + + + add 1 (6) + + + write very_important_count (6) + + + + + write very_important_count (6) + + + +
+ + + Race Conditions and Critical Regions + + This overlap, where the result depends on the + relative timing of multiple tasks, is called a race condition. + The piece of code containing the concurrency issue is called a + critical region. And especially since Linux starting running + on SMP machines, they became one of the major issues in kernel + design and implementation. + + + Preemption can have the same effect, even if there is only one + CPU: by preempting one task during the critical region, we have + exactly the same race condition. In this case the thread which + preempts might run the critical region itself. + + + The solution is to recognize when these simultaneous accesses + occur, and use locks to make sure that only one instance can + enter the critical region at any time. There are many + friendly primitives in the Linux kernel to help you do this. + And then there are the unfriendly primitives, but I'll pretend + they don't exist. + + +
+ + + Locking in the Linux Kernel + + + If I could give you one piece of advice: never sleep with anyone + crazier than yourself. But if I had to give you advice on + locking: keep it simple. + + + + Be reluctant to introduce new locks. + + + + Strangely enough, this last one is the exact reverse of my advice when + you have slept with someone crazier than yourself. + And you should think about getting a big dog. + + + + Two Main Types of Kernel Locks: Spinlocks and Semaphores + + + There are two main types of kernel locks. The fundamental type + is the spinlock + (include/asm/spinlock.h), + which is a very simple single-holder lock: if you can't get the + spinlock, you keep trying (spinning) until you can. Spinlocks are + very small and fast, and can be used anywhere. + + + The second type is a semaphore + (include/asm/semaphore.h): it + can have more than one holder at any time (the number decided at + initialization time), although it is most commonly used as a + single-holder lock (a mutex). If you can't get a semaphore, + your task will put itself on the queue, and be woken up when the + semaphore is released. This means the CPU will do something + else while you are waiting, but there are many cases when you + simply can't sleep (see ), and so + have to use a spinlock instead. + + + Neither type of lock is recursive: see + . + + + + + Locks and Uniprocessor Kernels + + + For kernels compiled without CONFIG_SMP, and + without CONFIG_PREEMPT spinlocks do not exist at + all. This is an excellent design decision: when no-one else can + run at the same time, there is no reason to have a lock. + + + + If the kernel is compiled without CONFIG_SMP, + but CONFIG_PREEMPT is set, then spinlocks + simply disable preemption, which is sufficient to prevent any + races. For most purposes, we can think of preemption as + equivalent to SMP, and not worry about it separately. + + + + You should always test your locking code with CONFIG_SMP + and CONFIG_PREEMPT enabled, even if you don't have an SMP test box, because it + will still catch some kinds of locking bugs. + + + + Semaphores still exist, because they are required for + synchronization between user + contexts, as we will see below. + + + + + Locking Only In User Context + + + If you have a data structure which is only ever accessed from + user context, then you can use a simple semaphore + (linux/asm/semaphore.h) to protect it. This + is the most trivial case: you initialize the semaphore to the number + of resources available (usually 1), and call + down_interruptible() to grab the semaphore, and + up() to release it. There is also a + down(), which should be avoided, because it + will not return if a signal is received. + + + + Example: linux/net/core/netfilter.c allows + registration of new setsockopt() and + getsockopt() calls, with + nf_register_sockopt(). Registration and + de-registration are only done on module load and unload (and boot + time, where there is no concurrency), and the list of registrations + is only consulted for an unknown setsockopt() + or getsockopt() system call. The + nf_sockopt_mutex is perfect to protect this, + especially since the setsockopt and getsockopt calls may well + sleep. + + + + + Locking Between User Context and Softirqs + + + If a softirq shares + data with user context, you have two problems. Firstly, the current + user context can be interrupted by a softirq, and secondly, the + critical region could be entered from another CPU. This is where + spin_lock_bh() + (include/linux/spinlock.h) is + used. It disables softirqs on that CPU, then grabs the lock. + spin_unlock_bh() does the reverse. (The + '_bh' suffix is a historical reference to "Bottom Halves", the + old name for software interrupts. It should really be + called spin_lock_softirq()' in a perfect world). + + + + Note that you can also use spin_lock_irq() + or spin_lock_irqsave() here, which stop + hardware interrupts as well: see . + + + + This works perfectly for UP + as well: the spin lock vanishes, and this macro + simply becomes local_bh_disable() + (include/linux/interrupt.h), which + protects you from the softirq being run. + + + + + Locking Between User Context and Tasklets + + + This is exactly the same as above, because tasklets are actually run + from a softirq. + + + + + Locking Between User Context and Timers + + + This, too, is exactly the same as above, because timers are actually run from + a softirq. From a locking point of view, tasklets and timers + are identical. + + + + + Locking Between Tasklets/Timers + + + Sometimes a tasklet or timer might want to share data with + another tasklet or timer. + + + + The Same Tasklet/Timer + + Since a tasklet is never run on two CPUs at once, you don't + need to worry about your tasklet being reentrant (running + twice at once), even on SMP. + + + + + Different Tasklets/Timers + + If another tasklet/timer wants + to share data with your tasklet or timer , you will both need to use + spin_lock() and + spin_unlock() calls. + spin_lock_bh() is + unnecessary here, as you are already in a tasklet, and + none will be run on the same CPU. + + + + + + Locking Between Softirqs + + + Often a softirq might + want to share data with itself or a tasklet/timer. + + + + The Same Softirq + + + The same softirq can run on the other CPUs: you can use a + per-CPU array (see ) for better + performance. If you're going so far as to use a softirq, + you probably care about scalable performance enough + to justify the extra complexity. + + + + You'll need to use spin_lock() and + spin_unlock() for shared data. + + + + + Different Softirqs + + + You'll need to use spin_lock() and + spin_unlock() for shared data, whether it + be a timer, tasklet, different softirq or the same or another + softirq: any of them could be running on a different CPU. + + + + + + + Hard IRQ Context + + + Hardware interrupts usually communicate with a + tasklet or softirq. Frequently this involves putting work in a + queue, which the softirq will take out. + + + + Locking Between Hard IRQ and Softirqs/Tasklets + + + If a hardware irq handler shares data with a softirq, you have + two concerns. Firstly, the softirq processing can be + interrupted by a hardware interrupt, and secondly, the + critical region could be entered by a hardware interrupt on + another CPU. This is where spin_lock_irq() is + used. It is defined to disable interrupts on that cpu, then grab + the lock. spin_unlock_irq() does the reverse. + + + + The irq handler does not to use + spin_lock_irq(), because the softirq cannot + run while the irq handler is running: it can use + spin_lock(), which is slightly faster. The + only exception would be if a different hardware irq handler uses + the same lock: spin_lock_irq() will stop + that from interrupting us. + + + + This works perfectly for UP as well: the spin lock vanishes, + and this macro simply becomes local_irq_disable() + (include/asm/smp.h), which + protects you from the softirq/tasklet/BH being run. + + + + spin_lock_irqsave() + (include/linux/spinlock.h) is a variant + which saves whether interrupts were on or off in a flags word, + which is passed to spin_unlock_irqrestore(). This + means that the same code can be used inside an hard irq handler (where + interrupts are already off) and in softirqs (where the irq + disabling is required). + + + + Note that softirqs (and hence tasklets and timers) are run on + return from hardware interrupts, so + spin_lock_irq() also stops these. In that + sense, spin_lock_irqsave() is the most + general and powerful locking function. + + + + + Locking Between Two Hard IRQ Handlers + + It is rare to have to share data between two IRQ handlers, but + if you do, spin_lock_irqsave() should be + used: it is architecture-specific whether all interrupts are + disabled inside irq handlers themselves. + + + + + + + Cheat Sheet For Locking + + Pete Zaitcev gives the following summary: + + + + + If you are in a process context (any syscall) and want to + lock other process out, use a semaphore. You can take a semaphore + and sleep (copy_from_user*( or + kmalloc(x,GFP_KERNEL)). + + + + + Otherwise (== data can be touched in an interrupt), use + spin_lock_irqsave() and + spin_unlock_irqrestore(). + + + + + Avoid holding spinlock for more than 5 lines of code and + across any function call (except accessors like + readb). + + + + + + Table of Minimum Requirements + + The following table lists the minimum + locking requirements between various contexts. In some cases, + the same context can only be running on one CPU at a time, so + no locking is required for that context (eg. a particular + thread can only run on one CPU at a time, but if it needs + shares data with another thread, locking is required). + + + Remember the advice above: you can always use + spin_lock_irqsave(), which is a superset + of all other spinlock primitives. + + +Table of Locking Requirements + + + + +IRQ Handler A +IRQ Handler B +Softirq A +Softirq B +Tasklet A +Tasklet B +Timer A +Timer B +User Context A +User Context B + + + +IRQ Handler A +None + + + +IRQ Handler B +spin_lock_irqsave +None + + + +Softirq A +spin_lock_irq +spin_lock_irq +spin_lock + + + +Softirq B +spin_lock_irq +spin_lock_irq +spin_lock +spin_lock + + + +Tasklet A +spin_lock_irq +spin_lock_irq +spin_lock +spin_lock +None + + + +Tasklet B +spin_lock_irq +spin_lock_irq +spin_lock +spin_lock +spin_lock +None + + + +Timer A +spin_lock_irq +spin_lock_irq +spin_lock +spin_lock +spin_lock +spin_lock +None + + + +Timer B +spin_lock_irq +spin_lock_irq +spin_lock +spin_lock +spin_lock +spin_lock +spin_lock +None + + + +User Context A +spin_lock_irq +spin_lock_irq +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +None + + + +User Context B +spin_lock_irq +spin_lock_irq +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +down_interruptible +None + + + + +
+
+
+ + + Common Examples + +Let's step through a simple example: a cache of number to name +mappings. The cache keeps a count of how often each of the objects is +used, and when it gets full, throws out the least used one. + + + + + All In User Context + +For our first example, we assume that all operations are in user +context (ie. from system calls), so we can sleep. This means we can +use a semaphore to protect the cache and all the objects within +it. Here's the code: + + + +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/string.h> +#include <asm/semaphore.h> +#include <asm/errno.h> + +struct object +{ + struct list_head list; + int id; + char name[32]; + int popularity; +}; + +/* Protects the cache, cache_num, and the objects within it */ +static DECLARE_MUTEX(cache_lock); +static LIST_HEAD(cache); +static unsigned int cache_num = 0; +#define MAX_CACHE_SIZE 10 + +/* Must be holding cache_lock */ +static struct object *__cache_find(int id) +{ + struct object *i; + + list_for_each_entry(i, &cache, list) + if (i->id == id) { + i->popularity++; + return i; + } + return NULL; +} + +/* Must be holding cache_lock */ +static void __cache_delete(struct object *obj) +{ + BUG_ON(!obj); + list_del(&obj->list); + kfree(obj); + cache_num--; +} + +/* Must be holding cache_lock */ +static void __cache_add(struct object *obj) +{ + list_add(&obj->list, &cache); + if (++cache_num > MAX_CACHE_SIZE) { + struct object *i, *outcast = NULL; + list_for_each_entry(i, &cache, list) { + if (!outcast || i->popularity < outcast->popularity) + outcast = i; + } + __cache_delete(outcast); + } +} + +int cache_add(int id, const char *name) +{ + struct object *obj; + + if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) + return -ENOMEM; + + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; + + down(&cache_lock); + __cache_add(obj); + up(&cache_lock); + return 0; +} + +void cache_delete(int id) +{ + down(&cache_lock); + __cache_delete(__cache_find(id)); + up(&cache_lock); +} + +int cache_find(int id, char *name) +{ + struct object *obj; + int ret = -ENOENT; + + down(&cache_lock); + obj = __cache_find(id); + if (obj) { + ret = 0; + strcpy(name, obj->name); + } + up(&cache_lock); + return ret; +} + + + +Note that we always make sure we have the cache_lock when we add, +delete, or look up the cache: both the cache infrastructure itself and +the contents of the objects are protected by the lock. In this case +it's easy, since we copy the data for the user, and never let them +access the objects directly. + + +There is a slight (and common) optimization here: in +cache_add we set up the fields of the object +before grabbing the lock. This is safe, as no-one else can access it +until we put it in cache. + + + + + Accessing From Interrupt Context + +Now consider the case where cache_find can be +called from interrupt context: either a hardware interrupt or a +softirq. An example would be a timer which deletes object from the +cache. + + +The change is shown below, in standard patch format: the +- are lines which are taken away, and the ++ are lines which are added. + + +--- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100 ++++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100 +@@ -12,7 +12,7 @@ + int popularity; + }; + +-static DECLARE_MUTEX(cache_lock); ++static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED; + static LIST_HEAD(cache); + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 +@@ -55,6 +55,7 @@ + int cache_add(int id, const char *name) + { + struct object *obj; ++ unsigned long flags; + + if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) + return -ENOMEM; +@@ -63,30 +64,33 @@ + obj->id = id; + obj->popularity = 0; + +- down(&cache_lock); ++ spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +- up(&cache_lock); ++ spin_unlock_irqrestore(&cache_lock, flags); + return 0; + } + + void cache_delete(int id) + { +- down(&cache_lock); ++ unsigned long flags; ++ ++ spin_lock_irqsave(&cache_lock, flags); + __cache_delete(__cache_find(id)); +- up(&cache_lock); ++ spin_unlock_irqrestore(&cache_lock, flags); + } + + int cache_find(int id, char *name) + { + struct object *obj; + int ret = -ENOENT; ++ unsigned long flags; + +- down(&cache_lock); ++ spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); + if (obj) { + ret = 0; + strcpy(name, obj->name); + } +- up(&cache_lock); ++ spin_unlock_irqrestore(&cache_lock, flags); + return ret; + } + + + +Note that the spin_lock_irqsave will turn off +interrupts if they are on, otherwise does nothing (if we are already +in an interrupt handler), hence these functions are safe to call from +any context. + + +Unfortunately, cache_add calls +kmalloc with the GFP_KERNEL +flag, which is only legal in user context. I have assumed that +cache_add is still only called in user context, +otherwise this should become a parameter to +cache_add. + + + + Exposing Objects Outside This File + +If our objects contained more information, it might not be sufficient +to copy the information in and out: other parts of the code might want +to keep pointers to these objects, for example, rather than looking up +the id every time. This produces two problems. + + +The first problem is that we use the cache_lock to +protect objects: we'd need to make this non-static so the rest of the +code can use it. This makes locking trickier, as it is no longer all +in one place. + + +The second problem is the lifetime problem: if another structure keeps +a pointer to an object, it presumably expects that pointer to remain +valid. Unfortunately, this is only guaranteed while you hold the +lock, otherwise someone might call cache_delete +and even worse, add another object, re-using the same address. + + +As there is only one lock, you can't hold it forever: no-one else would +get any work done. + + +The solution to this problem is to use a reference count: everyone who +has a pointer to the object increases it when they first get the +object, and drops the reference count when they're finished with it. +Whoever drops it to zero knows it is unused, and can actually delete it. + + +Here is the code: + + + +--- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100 ++++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100 +@@ -7,6 +7,7 @@ + struct object + { + struct list_head list; ++ unsigned int refcnt; + int id; + char name[32]; + int popularity; +@@ -17,6 +18,35 @@ + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 + ++static void __object_put(struct object *obj) ++{ ++ if (--obj->refcnt == 0) ++ kfree(obj); ++} ++ ++static void __object_get(struct object *obj) ++{ ++ obj->refcnt++; ++} ++ ++void object_put(struct object *obj) ++{ ++ unsigned long flags; ++ ++ spin_lock_irqsave(&cache_lock, flags); ++ __object_put(obj); ++ spin_unlock_irqrestore(&cache_lock, flags); ++} ++ ++void object_get(struct object *obj) ++{ ++ unsigned long flags; ++ ++ spin_lock_irqsave(&cache_lock, flags); ++ __object_get(obj); ++ spin_unlock_irqrestore(&cache_lock, flags); ++} ++ + /* Must be holding cache_lock */ + static struct object *__cache_find(int id) + { +@@ -35,6 +65,7 @@ + { + BUG_ON(!obj); + list_del(&obj->list); ++ __object_put(obj); + cache_num--; + } + +@@ -63,6 +94,7 @@ + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; ++ obj->refcnt = 1; /* The cache holds a reference */ + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +@@ -79,18 +111,15 @@ + spin_unlock_irqrestore(&cache_lock, flags); + } + +-int cache_find(int id, char *name) ++struct object *cache_find(int id) + { + struct object *obj; +- int ret = -ENOENT; + unsigned long flags; + + spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); +- if (obj) { +- ret = 0; +- strcpy(name, obj->name); +- } ++ if (obj) ++ __object_get(obj); + spin_unlock_irqrestore(&cache_lock, flags); +- return ret; ++ return obj; + } + + + +We encapsulate the reference counting in the standard 'get' and 'put' +functions. Now we can return the object itself from +cache_find which has the advantage that the user +can now sleep holding the object (eg. to +copy_to_user to name to userspace). + + +The other point to note is that I said a reference should be held for +every pointer to the object: thus the reference count is 1 when first +inserted into the cache. In some versions the framework does not hold +a reference count, but they are more complicated. + + + + Using Atomic Operations For The Reference Count + +In practice, atomic_t would usually be used for +refcnt. There are a number of atomic +operations defined in + +include/asm/atomic.h: these are +guaranteed to be seen atomically from all CPUs in the system, so no +lock is required. In this case, it is simpler than using spinlocks, +although for anything non-trivial using spinlocks is clearer. The +atomic_inc and +atomic_dec_and_test are used instead of the +standard increment and decrement operators, and the lock is no longer +used to protect the reference count itself. + + + +--- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100 ++++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100 +@@ -7,7 +7,7 @@ + struct object + { + struct list_head list; +- unsigned int refcnt; ++ atomic_t refcnt; + int id; + char name[32]; + int popularity; +@@ -18,33 +18,15 @@ + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 + +-static void __object_put(struct object *obj) +-{ +- if (--obj->refcnt == 0) +- kfree(obj); +-} +- +-static void __object_get(struct object *obj) +-{ +- obj->refcnt++; +-} +- + void object_put(struct object *obj) + { +- unsigned long flags; +- +- spin_lock_irqsave(&cache_lock, flags); +- __object_put(obj); +- spin_unlock_irqrestore(&cache_lock, flags); ++ if (atomic_dec_and_test(&obj->refcnt)) ++ kfree(obj); + } + + void object_get(struct object *obj) + { +- unsigned long flags; +- +- spin_lock_irqsave(&cache_lock, flags); +- __object_get(obj); +- spin_unlock_irqrestore(&cache_lock, flags); ++ atomic_inc(&obj->refcnt); + } + + /* Must be holding cache_lock */ +@@ -65,7 +47,7 @@ + { + BUG_ON(!obj); + list_del(&obj->list); +- __object_put(obj); ++ object_put(obj); + cache_num--; + } + +@@ -94,7 +76,7 @@ + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; +- obj->refcnt = 1; /* The cache holds a reference */ ++ atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +@@ -119,7 +101,7 @@ + spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); + if (obj) +- __object_get(obj); ++ object_get(obj); + spin_unlock_irqrestore(&cache_lock, flags); + return obj; + } + + + + + + Protecting The Objects Themselves + +In these examples, we assumed that the objects (except the reference +counts) never changed once they are created. If we wanted to allow +the name to change, there are three possibilities: + + + + +You can make cache_lock non-static, and tell people +to grab that lock before changing the name in any object. + + + + +You can provide a cache_obj_rename which grabs +this lock and changes the name for the caller, and tell everyone to +use that function. + + + + +You can make the cache_lock protect only the cache +itself, and use another lock to protect the name. + + + + + +Theoretically, you can make the locks as fine-grained as one lock for +every field, for every object. In practice, the most common variants +are: + + + + +One lock which protects the infrastructure (the cache +list in this example) and all the objects. This is what we have done +so far. + + + + +One lock which protects the infrastructure (including the list +pointers inside the objects), and one lock inside the object which +protects the rest of that object. + + + + +Multiple locks to protect the infrastructure (eg. one lock per hash +chain), possibly with a separate per-object lock. + + + + + +Here is the "lock-per-object" implementation: + + +--- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100 ++++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 +@@ -6,11 +6,17 @@ + + struct object + { ++ /* These two protected by cache_lock. */ + struct list_head list; ++ int popularity; ++ + atomic_t refcnt; ++ ++ /* Doesn't change once created. */ + int id; ++ ++ spinlock_t lock; /* Protects the name */ + char name[32]; +- int popularity; + }; + + static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED; +@@ -77,6 +84,7 @@ + obj->id = id; + obj->popularity = 0; + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ ++ spin_lock_init(&obj->lock); + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); + + + +Note that I decide that the popularity +count should be protected by the cache_lock rather +than the per-object lock: this is because it (like the +struct list_head inside the object) is +logically part of the infrastructure. This way, I don't need to grab +the lock of every object in __cache_add when +seeking the least popular. + + + +I also decided that the id member is +unchangeable, so I don't need to grab each object lock in +__cache_find() to examine the +id: the object lock is only used by a +caller who wants to read or write the name +field. + + + +Note also that I added a comment describing what data was protected by +which locks. This is extremely important, as it describes the runtime +behavior of the code, and can be hard to gain from just reading. And +as Alan Cox says, Lock data, not code. + + + + + + Common Problems + + Deadlock: Simple and Advanced + + + There is a coding bug where a piece of code tries to grab a + spinlock twice: it will spin forever, waiting for the lock to + be released (spinlocks, rwlocks and semaphores are not + recursive in Linux). This is trivial to diagnose: not a + stay-up-five-nights-talk-to-fluffy-code-bunnies kind of + problem. + + + + For a slightly more complex case, imagine you have a region + shared by a softirq and user context. If you use a + spin_lock() call to protect it, it is + possible that the user context will be interrupted by the softirq + while it holds the lock, and the softirq will then spin + forever trying to get the same lock. + + + + Both of these are called deadlock, and as shown above, it can + occur even with a single CPU (although not on UP compiles, + since spinlocks vanish on kernel compiles with + CONFIG_SMP=n. You'll still get data corruption + in the second example). + + + + This complete lockup is easy to diagnose: on SMP boxes the + watchdog timer or compiling with DEBUG_SPINLOCKS set + (include/linux/spinlock.h) will show this up + immediately when it happens. + + + + A more complex problem is the so-called 'deadly embrace', + involving two or more locks. Say you have a hash table: each + entry in the table is a spinlock, and a chain of hashed + objects. Inside a softirq handler, you sometimes want to + alter an object from one place in the hash to another: you + grab the spinlock of the old hash chain and the spinlock of + the new hash chain, and delete the object from the old one, + and insert it in the new one. + + + + There are two problems here. First, if your code ever + tries to move the object to the same chain, it will deadlock + with itself as it tries to lock it twice. Secondly, if the + same softirq on another CPU is trying to move another object + in the reverse direction, the following could happen: + + + + Consequences + + + + + + CPU 1 + CPU 2 + + + + + + Grab lock A -> OK + Grab lock B -> OK + + + Grab lock B -> spin + Grab lock A -> spin + + + +
+ + + The two CPUs will spin forever, waiting for the other to give up + their lock. It will look, smell, and feel like a crash. + +
+ + + Preventing Deadlock + + + Textbooks will tell you that if you always lock in the same + order, you will never get this kind of deadlock. Practice + will tell you that this approach doesn't scale: when I + create a new lock, I don't understand enough of the kernel + to figure out where in the 5000 lock hierarchy it will fit. + + + + The best locks are encapsulated: they never get exposed in + headers, and are never held around calls to non-trivial + functions outside the same file. You can read through this + code and see that it will never deadlock, because it never + tries to grab another lock while it has that one. People + using your code don't even need to know you are using a + lock. + + + + A classic problem here is when you provide callbacks or + hooks: if you call these with the lock held, you risk simple + deadlock, or a deadly embrace (who knows what the callback + will do?). Remember, the other programmers are out to get + you, so don't do this. + + + + Overzealous Prevention Of Deadlocks + + + Deadlocks are problematic, but not as bad as data + corruption. Code which grabs a read lock, searches a list, + fails to find what it wants, drops the read lock, grabs a + write lock and inserts the object has a race condition. + + + + If you don't see why, please stay the fuck away from my code. + + + + + + Racing Timers: A Kernel Pastime + + + Timers can produce their own special problems with races. + Consider a collection of objects (list, hash, etc) where each + object has a timer which is due to destroy it. + + + + If you want to destroy the entire collection (say on module + removal), you might do the following: + + + + /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE + HUNGARIAN NOTATION */ + spin_lock_bh(&list_lock); + + while (list) { + struct foo *next = list->next; + del_timer(&list->timer); + kfree(list); + list = next; + } + + spin_unlock_bh(&list_lock); + + + + Sooner or later, this will crash on SMP, because a timer can + have just gone off before the spin_lock_bh(), + and it will only get the lock after we + spin_unlock_bh(), and then try to free + the element (which has already been freed!). + + + + This can be avoided by checking the result of + del_timer(): if it returns + 1, the timer has been deleted. + If 0, it means (in this + case) that it is currently running, so we can do: + + + + retry: + spin_lock_bh(&list_lock); + + while (list) { + struct foo *next = list->next; + if (!del_timer(&list->timer)) { + /* Give timer a chance to delete this */ + spin_unlock_bh(&list_lock); + goto retry; + } + kfree(list); + list = next; + } + + spin_unlock_bh(&list_lock); + + + + Another common problem is deleting timers which restart + themselves (by calling add_timer() at the end + of their timer function). Because this is a fairly common case + which is prone to races, you should use del_timer_sync() + (include/linux/timer.h) + to handle this case. It returns the number of times the timer + had to be deleted before we finally stopped it from adding itself back + in. + + + +
+ + + Locking Speed + + +There are three main things to worry about when considering speed of +some code which does locking. First is concurrency: how many things +are going to be waiting while someone else is holding a lock. Second +is the time taken to actually acquire and release an uncontended lock. +Third is using fewer, or smarter locks. I'm assuming that the lock is +used fairly often: otherwise, you wouldn't be concerned about +efficiency. + + +Concurrency depends on how long the lock is usually held: you should +hold the lock for as long as needed, but no longer. In the cache +example, we always create the object without the lock held, and then +grab the lock only when we are ready to insert it in the list. + + +Acquisition times depend on how much damage the lock operations do to +the pipeline (pipeline stalls) and how likely it is that this CPU was +the last one to grab the lock (ie. is the lock cache-hot for this +CPU): on a machine with more CPUs, this likelihood drops fast. +Consider a 700MHz Intel Pentium III: an instruction takes about 0.7ns, +an atomic increment takes about 58ns, a lock which is cache-hot on +this CPU takes 160ns, and a cacheline transfer from another CPU takes +an additional 170 to 360ns. (These figures from Paul McKenney's + Linux +Journal RCU article). + + +These two aims conflict: holding a lock for a short time might be done +by splitting locks into parts (such as in our final per-object-lock +example), but this increases the number of lock acquisitions, and the +results are often slower than having a single lock. This is another +reason to advocate locking simplicity. + + +The third concern is addressed below: there are some methods to reduce +the amount of locking which needs to be done. + + + + Read/Write Lock Variants + + + Both spinlocks and semaphores have read/write variants: + rwlock_t and struct rw_semaphore. + These divide users into two classes: the readers and the writers. If + you are only reading the data, you can get a read lock, but to write to + the data you need the write lock. Many people can hold a read lock, + but a writer must be sole holder. + + + + If your code divides neatly along reader/writer lines (as our + cache code does), and the lock is held by readers for + significant lengths of time, using these locks can help. They + are slightly slower than the normal locks though, so in practice + rwlock_t is not usually worthwhile. + + + + + Avoiding Locks: Read Copy Update + + + There is a special method of read/write locking called Read Copy + Update. Using RCU, the readers can avoid taking a lock + altogether: as we expect our cache to be read more often than + updated (otherwise the cache is a waste of time), it is a + candidate for this optimization. + + + + How do we get rid of read locks? Getting rid of read locks + means that writers may be changing the list underneath the + readers. That is actually quite simple: we can read a linked + list while an element is being added if the writer adds the + element very carefully. For example, adding + new to a single linked list called + list: + + + + new->next = list->next; + wmb(); + list->next = new; + + + + The wmb() is a write memory barrier. It + ensures that the first operation (setting the new element's + next pointer) is complete and will be seen by + all CPUs, before the second operation is (putting the new + element into the list). This is important, since modern + compilers and modern CPUs can both reorder instructions unless + told otherwise: we want a reader to either not see the new + element at all, or see the new element with the + next pointer correctly pointing at the rest of + the list. + + + Fortunately, there is a function to do this for standard + struct list_head lists: + list_add_rcu() + (include/linux/list.h). + + + Removing an element from the list is even simpler: we replace + the pointer to the old element with a pointer to its successor, + and readers will either see it, or skip over it. + + + list->next = old->next; + + + There is list_del_rcu() + (include/linux/list.h) which does this (the + normal version poisons the old object, which we don't want). + + + The reader must also be careful: some CPUs can look through the + next pointer to start reading the contents of + the next element early, but don't realize that the pre-fetched + contents is wrong when the next pointer changes + underneath them. Once again, there is a + list_for_each_entry_rcu() + (include/linux/list.h) to help you. Of + course, writers can just use + list_for_each_entry(), since there cannot + be two simultaneous writers. + + + Our final dilemma is this: when can we actually destroy the + removed element? Remember, a reader might be stepping through + this element in the list right now: it we free this element and + the next pointer changes, the reader will jump + off into garbage and crash. We need to wait until we know that + all the readers who were traversing the list when we deleted the + element are finished. We use call_rcu() to + register a callback which will actually destroy the object once + the readers are finished. + + + But how does Read Copy Update know when the readers are + finished? The method is this: firstly, the readers always + traverse the list inside + rcu_read_lock()/rcu_read_unlock() + pairs: these simply disable preemption so the reader won't go to + sleep while reading the list. + + + RCU then waits until every other CPU has slept at least once: + since readers cannot sleep, we know that any readers which were + traversing the list during the deletion are finished, and the + callback is triggered. The real Read Copy Update code is a + little more optimized than this, but this is the fundamental + idea. + + + +--- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 ++++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100 +@@ -1,15 +1,18 @@ + #include <linux/list.h> + #include <linux/slab.h> + #include <linux/string.h> ++#include <linux/rcupdate.h> + #include <asm/semaphore.h> + #include <asm/errno.h> + + struct object + { +- /* These two protected by cache_lock. */ ++ /* This is protected by RCU */ + struct list_head list; + int popularity; + ++ struct rcu_head rcu; ++ + atomic_t refcnt; + + /* Doesn't change once created. */ +@@ -40,7 +43,7 @@ + { + struct object *i; + +- list_for_each_entry(i, &cache, list) { ++ list_for_each_entry_rcu(i, &cache, list) { + if (i->id == id) { + i->popularity++; + return i; +@@ -49,19 +52,25 @@ + return NULL; + } + ++/* Final discard done once we know no readers are looking. */ ++static void cache_delete_rcu(void *arg) ++{ ++ object_put(arg); ++} ++ + /* Must be holding cache_lock */ + static void __cache_delete(struct object *obj) + { + BUG_ON(!obj); +- list_del(&obj->list); +- object_put(obj); ++ list_del_rcu(&obj->list); + cache_num--; ++ call_rcu(&obj->rcu, cache_delete_rcu, obj); + } + + /* Must be holding cache_lock */ + static void __cache_add(struct object *obj) + { +- list_add(&obj->list, &cache); ++ list_add_rcu(&obj->list, &cache); + if (++cache_num > MAX_CACHE_SIZE) { + struct object *i, *outcast = NULL; + list_for_each_entry(i, &cache, list) { +@@ -85,6 +94,7 @@ + obj->popularity = 0; + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ + spin_lock_init(&obj->lock); ++ INIT_RCU_HEAD(&obj->rcu); + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +@@ -104,12 +114,11 @@ + struct object *cache_find(int id) + { + struct object *obj; +- unsigned long flags; + +- spin_lock_irqsave(&cache_lock, flags); ++ rcu_read_lock(); + obj = __cache_find(id); + if (obj) + object_get(obj); +- spin_unlock_irqrestore(&cache_lock, flags); ++ rcu_read_unlock(); + return obj; + } + + + +Note that the reader will alter the +popularity member in +__cache_find(), and now it doesn't hold a lock. +One solution would be to make it an atomic_t, but for +this usage, we don't really care about races: an approximate result is +good enough, so I didn't change it. + + + +The result is that cache_find() requires no +synchronization with any other functions, so is almost as fast on SMP +as it would be on UP. + + + +There is a furthur optimization possible here: remember our original +cache code, where there were no reference counts and the caller simply +held the lock whenever using the object? This is still possible: if +you hold the lock, noone can delete the object, so you don't need to +get and put the reference count. + + + +Now, because the 'read lock' in RCU is simply disabling preemption, a +caller which always has preemption disabled between calling +cache_find() and +object_put() does not need to actually get and +put the reference count: we could expose +__cache_find() by making it non-static, and +such callers could simply call that. + + +The benefit here is that the reference count is not written to: the +object is not altered in any way, which is much faster on SMP +machines due to caching. + + + + + Per-CPU Data + + + Another technique for avoiding locking which is used fairly + widely is to duplicate information for each CPU. For example, + if you wanted to keep a count of a common condition, you could + use a spin lock and a single counter. Nice and simple. + + + + If that was too slow (it's usually not, but if you've got a + really big machine to test on and can show that it is), you + could instead use a counter for each CPU, then none of them need + an exclusive lock. See DEFINE_PER_CPU(), + get_cpu_var() and + put_cpu_var() + (include/linux/percpu.h). + + + + Of particular use for simple per-cpu counters is the + local_t type, and the + cpu_local_inc() and related functions, + which are more efficient than simple code on some architectures + (include/asm/local.h). + + + + Note that there is no simple, reliable way of getting an exact + value of such a counter, without introducing more locks. This + is not a problem for some uses. + + + + + Data Which Mostly Used By An IRQ Handler + + + If data is always accessed from within the same IRQ handler, you + don't need a lock at all: the kernel already guarantees that the + irq handler will not run simultaneously on multiple CPUs. + + + Manfred Spraul points out that you can still do this, even if + the data is very occasionally accessed in user context or + softirqs/tasklets. The irq handler doesn't use a lock, and + all other accesses are done as so: + + + + spin_lock(&lock); + disable_irq(irq); + ... + enable_irq(irq); + spin_unlock(&lock); + + + The disable_irq() prevents the irq handler + from running (and waits for it to finish if it's currently + running on other CPUs). The spinlock prevents any other + accesses happening at the same time. Naturally, this is slower + than just a spin_lock_irq() call, so it + only makes sense if this type of access happens extremely + rarely. + + + + + + What Functions Are Safe To Call From Interrupts? + + + Many functions in the kernel sleep (ie. call schedule()) + directly or indirectly: you can never call them while holding a + spinlock, or with preemption disabled. This also means you need + to be in user context: calling them from an interrupt is illegal. + + + + Some Functions Which Sleep + + + The most common ones are listed below, but you usually have to + read the code to find out if other calls are safe. If everyone + else who calls it can sleep, you probably need to be able to + sleep, too. In particular, registration and deregistration + functions usually expect to be called from user context, and can + sleep. + + + + + + Accesses to + userspace: + + + + + copy_from_user() + + + + + copy_to_user() + + + + + get_user() + + + + + put_user() + + + + + + + + kmalloc(GFP_KERNEL) + + + + + + down_interruptible() and + down() + + + There is a down_trylock() which can be + used inside interrupt context, as it will not sleep. + up() will also never sleep. + + + + + + + Some Functions Which Don't Sleep + + + Some functions are safe to call from any context, or holding + almost any lock. + + + + + + printk() + + + + + kfree() + + + + + add_timer() and del_timer() + + + + + + + + Further reading + + + + + Documentation/spinlocks.txt: + Linus Torvalds' spinlocking tutorial in the kernel sources. + + + + + + Unix Systems for Modern Architectures: Symmetric + Multiprocessing and Caching for Kernel Programmers: + + + + Curt Schimmel's very good introduction to kernel level + locking (not written for Linux, but nearly everything + applies). The book is expensive, but really worth every + penny to understand SMP locking. [ISBN: 0201633388] + + + + + + + Thanks + + + Thanks to Telsa Gwynne for DocBooking, neatening and adding + style. + + + + Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul + Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim + Waugh, Pete Zaitcev, James Morris, Robert Love, Paul McKenney, + John Ashby for proofreading, correcting, flaming, commenting. + + + + Thanks to the cabal for having no influence on this document. + + + + + Glossary + + + preemption + + + Prior to 2.5, or when CONFIG_PREEMPT is + unset, processes in user context inside the kernel would not + preempt each other (ie. you had that CPU until you have it up, + except for interrupts). With the addition of + CONFIG_PREEMPT in 2.5.4, this changed: when + in user context, higher priority tasks can "cut in": spinlocks + were changed to disable preemption, even on UP. + + + + + + bh + + + Bottom Half: for historical reasons, functions with + '_bh' in them often now refer to any software interrupt, e.g. + spin_lock_bh() blocks any software interrupt + on the current CPU. Bottom halves are deprecated, and will + eventually be replaced by tasklets. Only one bottom half will be + running at any time. + + + + + + Hardware Interrupt / Hardware IRQ + + + Hardware interrupt request. in_irq() returns + true in a hardware interrupt handler. + + + + + + Interrupt Context + + + Not user context: processing a hardware irq or software irq. + Indicated by the in_interrupt() macro + returning true. + + + + + + SMP + + + Symmetric Multi-Processor: kernels compiled for multiple-CPU + machines. (CONFIG_SMP=y). + + + + + + Software Interrupt / softirq + + + Software interrupt handler. in_irq() returns + false; in_softirq() + returns true. Tasklets and softirqs + both fall into the category of 'software interrupts'. + + + Strictly speaking a softirq is one of up to 32 enumerated software + interrupts which can run on multiple CPUs at once. + Sometimes used to refer to tasklets as + well (ie. all software interrupts). + + + + + + tasklet + + + A dynamically-registrable software interrupt, + which is guaranteed to only run on one CPU at a time. + + + + + + timer + + + A dynamically-registrable software interrupt, which is run at + (or close to) a given time. When running, it is just like a + tasklet (in fact, they are called from the TIMER_SOFTIRQ). + + + + + + UP + + + Uni-Processor: Non-SMP. (CONFIG_SMP=n). + + + + + + User Context + + + The kernel executing on behalf of a particular process (ie. a + system call or trap) or kernel thread. You can tell which + process with the current macro.) Not to + be confused with userspace. Can be interrupted by software or + hardware interrupts. + + + + + + Userspace + + + A process executing its own code outside the kernel. + + + + + +
+ diff --git a/Documentation/DocBook/libata.tmpl b/Documentation/DocBook/libata.tmpl new file mode 100644 index 000000000000..cf2fce7707da --- /dev/null +++ b/Documentation/DocBook/libata.tmpl @@ -0,0 +1,282 @@ + + + + + + libATA Developer's Guide + + + + Jeff + Garzik + + + + + 2003 + Jeff Garzik + + + + + The contents of this file are subject to the Open + Software License version 1.1 that can be found at + http://www.opensource.org/licenses/osl-1.1.txt and is included herein + by reference. + + + + Alternatively, the contents of this file may be used under the terms + of the GNU General Public License version 2 (the "GPL") as distributed + in the kernel source COPYING file, in which case the provisions of + the GPL are applicable instead of the above. If you wish to allow + the use of your version of this file only under the terms of the + GPL and not to allow others to use your version of this file under + the OSL, indicate your decision by deleting the provisions above and + replace them with the notice and other provisions required by the GPL. + If you do not delete the provisions above, a recipient may use your + version of this file under either the OSL or the GPL. + + + + + + + + + Thanks + + The bulk of the ATA knowledge comes thanks to long conversations with + Andre Hedrick (www.linux-ide.org). + + + Thanks to Alan Cox for pointing out similarities + between SATA and SCSI, and in general for motivation to hack on + libata. + + + libata's device detection + method, ata_pio_devchk, and in general all the early probing was + based on extensive study of Hale Landis's probe/reset code in his + ATADRVR driver (www.ata-atapi.com). + + + + + libata Driver API + + struct ata_port_operations + + +void (*port_disable) (struct ata_port *); + + + + Called from ata_bus_probe() and ata_bus_reset() error paths, + as well as when unregistering from the SCSI module (rmmod, hot + unplug). + + + +void (*dev_config) (struct ata_port *, struct ata_device *); + + + + Called after IDENTIFY [PACKET] DEVICE is issued to each device + found. Typically used to apply device-specific fixups prior to + issue of SET FEATURES - XFER MODE, and prior to operation. + + + +void (*set_piomode) (struct ata_port *, struct ata_device *); +void (*set_dmamode) (struct ata_port *, struct ata_device *); +void (*post_set_mode) (struct ata_port *ap); + + + + Hooks called prior to the issue of SET FEATURES - XFER MODE + command. dev->pio_mode is guaranteed to be valid when + ->set_piomode() is called, and dev->dma_mode is guaranteed to be + valid when ->set_dmamode() is called. ->post_set_mode() is + called unconditionally, after the SET FEATURES - XFER MODE + command completes successfully. + + + + ->set_piomode() is always called (if present), but + ->set_dma_mode() is only called if DMA is possible. + + + +void (*tf_load) (struct ata_port *ap, struct ata_taskfile *tf); +void (*tf_read) (struct ata_port *ap, struct ata_taskfile *tf); + + + + ->tf_load() is called to load the given taskfile into hardware + registers / DMA buffers. ->tf_read() is called to read the + hardware registers / DMA buffers, to obtain the current set of + taskfile register values. + + + +void (*exec_command)(struct ata_port *ap, struct ata_taskfile *tf); + + + + causes an ATA command, previously loaded with + ->tf_load(), to be initiated in hardware. + + + +u8 (*check_status)(struct ata_port *ap); +void (*dev_select)(struct ata_port *ap, unsigned int device); + + + + Reads the Status ATA shadow register from hardware. On some + hardware, this has the side effect of clearing the interrupt + condition. + + + +void (*dev_select)(struct ata_port *ap, unsigned int device); + + + + Issues the low-level hardware command(s) that causes one of N + hardware devices to be considered 'selected' (active and + available for use) on the ATA bus. + + + +void (*phy_reset) (struct ata_port *ap); + + + + The very first step in the probe phase. Actions vary depending + on the bus type, typically. After waking up the device and probing + for device presence (PATA and SATA), typically a soft reset + (SRST) will be performed. Drivers typically use the helper + functions ata_bus_reset() or sata_phy_reset() for this hook. + + + +void (*bmdma_setup) (struct ata_queued_cmd *qc); +void (*bmdma_start) (struct ata_queued_cmd *qc); + + + + When setting up an IDE BMDMA transaction, these hooks arm + (->bmdma_setup) and fire (->bmdma_start) the hardware's DMA + engine. + + + +void (*qc_prep) (struct ata_queued_cmd *qc); +int (*qc_issue) (struct ata_queued_cmd *qc); + + + + Higher-level hooks, these two hooks can potentially supercede + several of the above taskfile/DMA engine hooks. ->qc_prep is + called after the buffers have been DMA-mapped, and is typically + used to populate the hardware's DMA scatter-gather table. + Most drivers use the standard ata_qc_prep() helper function, but + more advanced drivers roll their own. + + + ->qc_issue is used to make a command active, once the hardware + and S/G tables have been prepared. IDE BMDMA drivers use the + helper function ata_qc_issue_prot() for taskfile protocol-based + dispatch. More advanced drivers roll their own ->qc_issue + implementation, using this as the "issue new ATA command to + hardware" hook. + + + +void (*eng_timeout) (struct ata_port *ap); + + + + This is a high level error handling function, called from the + error handling thread, when a command times out. + + + +irqreturn_t (*irq_handler)(int, void *, struct pt_regs *); +void (*irq_clear) (struct ata_port *); + + + + ->irq_handler is the interrupt handling routine registered with + the system, by libata. ->irq_clear is called during probe just + before the interrupt handler is registered, to be sure hardware + is quiet. + + + +u32 (*scr_read) (struct ata_port *ap, unsigned int sc_reg); +void (*scr_write) (struct ata_port *ap, unsigned int sc_reg, + u32 val); + + + + Read and write standard SATA phy registers. Currently only used + if ->phy_reset hook called the sata_phy_reset() helper function. + + + +int (*port_start) (struct ata_port *ap); +void (*port_stop) (struct ata_port *ap); +void (*host_stop) (struct ata_host_set *host_set); + + + + ->port_start() is called just after the data structures for each + port are initialized. Typically this is used to alloc per-port + DMA buffers / tables / rings, enable DMA engines, and similar + tasks. + + + ->host_stop() is called when the rmmod or hot unplug process + begins. The hook must stop all hardware interrupts, DMA + engines, etc. + + + ->port_stop() is called after ->host_stop(). It's sole function + is to release DMA/memory resources, now that they are no longer + actively being used. + + + + + + + libata Library +!Edrivers/scsi/libata-core.c + + + + libata Core Internals +!Idrivers/scsi/libata-core.c + + + + libata SCSI translation/emulation +!Edrivers/scsi/libata-scsi.c +!Idrivers/scsi/libata-scsi.c + + + + ata_piix Internals +!Idrivers/scsi/ata_piix.c + + + + sata_sil Internals +!Idrivers/scsi/sata_sil.c + + + diff --git a/Documentation/DocBook/librs.tmpl b/Documentation/DocBook/librs.tmpl new file mode 100644 index 000000000000..3ff39bafc00e --- /dev/null +++ b/Documentation/DocBook/librs.tmpl @@ -0,0 +1,289 @@ + + + + + + Reed-Solomon Library Programming Interface + + + + Thomas + Gleixner + +
+ tglx@linutronix.de +
+
+
+
+ + + 2004 + Thomas Gleixner + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License version 2 as published by the Free Software Foundation. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + +
+ + + + + Introduction + + The generic Reed-Solomon Library provides encoding, decoding + and error correction functions. + + + Reed-Solomon codes are used in communication and storage + applications to ensure data integrity. + + + This documentation is provided for developers who want to utilize + the functions provided by the library. + + + + + Known Bugs And Assumptions + + None. + + + + + Usage + + This chapter provides examples how to use the library. + + + Initializing + + The init function init_rs returns a pointer to a + rs decoder structure, which holds the necessary + information for encoding, decoding and error correction + with the given polynomial. It either uses an existing + matching decoder or creates a new one. On creation all + the lookup tables for fast en/decoding are created. + The function may take a while, so make sure not to + call it in critical code paths. + + +/* the Reed Solomon control structure */ +static struct rs_control *rs_decoder; + +/* Symbolsize is 10 (bits) + * Primitve polynomial is x^10+x^3+1 + * first consecutive root is 0 + * primitve element to generate roots = 1 + * generator polinomial degree (number of roots) = 6 + */ +rs_decoder = init_rs (10, 0x409, 0, 1, 6); + + + + Encoding + + The encoder calculates the Reed-Solomon code over + the given data length and stores the result in + the parity buffer. Note that the parity buffer must + be initialized before calling the encoder. + + + The expanded data can be inverted on the fly by + providing a non zero inversion mask. The expanded data is + XOR'ed with the mask. This is used e.g. for FLASH + ECC, where the all 0xFF is inverted to an all 0x00. + The Reed-Solomon code for all 0x00 is all 0x00. The + code is inverted before storing to FLASH so it is 0xFF + too. This prevent's that reading from an erased FLASH + results in ECC errors. + + + The databytes are expanded to the given symbol size + on the fly. There is no support for encoding continuous + bitstreams with a symbol size != 8 at the moment. If + it is necessary it should be not a big deal to implement + such functionality. + + +/* Parity buffer. Size = number of roots */ +uint16_t par[6]; +/* Initialize the parity buffer */ +memset(par, 0, sizeof(par)); +/* Encode 512 byte in data8. Store parity in buffer par */ +encode_rs8 (rs_decoder, data8, 512, par, 0); + + + + Decoding + + The decoder calculates the syndrome over + the given data length and the received parity symbols + and corrects errors in the data. + + + If a syndrome is available from a hardware decoder + then the syndrome calculation is skipped. + + + The correction of the data buffer can be suppressed + by providing a correction pattern buffer and an error + location buffer to the decoder. The decoder stores the + calculated error location and the correction bitmask + in the given buffers. This is useful for hardware + decoders which use a weird bit ordering scheme. + + + The databytes are expanded to the given symbol size + on the fly. There is no support for decoding continuous + bitstreams with a symbolsize != 8 at the moment. If + it is necessary it should be not a big deal to implement + such functionality. + + + + + Decoding with syndrome calculation, direct data correction + + +/* Parity buffer. Size = number of roots */ +uint16_t par[6]; +uint8_t data[512]; +int numerr; +/* Receive data */ +..... +/* Receive parity */ +..... +/* Decode 512 byte in data8.*/ +numerr = decode_rs8 (rs_decoder, data8, par, 512, NULL, 0, NULL, 0, NULL); + + + + + + Decoding with syndrome given by hardware decoder, direct data correction + + +/* Parity buffer. Size = number of roots */ +uint16_t par[6], syn[6]; +uint8_t data[512]; +int numerr; +/* Receive data */ +..... +/* Receive parity */ +..... +/* Get syndrome from hardware decoder */ +..... +/* Decode 512 byte in data8.*/ +numerr = decode_rs8 (rs_decoder, data8, par, 512, syn, 0, NULL, 0, NULL); + + + + + + Decoding with syndrome given by hardware decoder, no direct data correction. + + + Note: It's not necessary to give data and received parity to the decoder. + + +/* Parity buffer. Size = number of roots */ +uint16_t par[6], syn[6], corr[8]; +uint8_t data[512]; +int numerr, errpos[8]; +/* Receive data */ +..... +/* Receive parity */ +..... +/* Get syndrome from hardware decoder */ +..... +/* Decode 512 byte in data8.*/ +numerr = decode_rs8 (rs_decoder, NULL, NULL, 512, syn, 0, errpos, 0, corr); +for (i = 0; i < numerr; i++) { + do_error_correction_in_your_buffer(errpos[i], corr[i]); +} + + + + + Cleanup + + The function free_rs frees the allocated resources, + if the caller is the last user of the decoder. + + +/* Release resources */ +free_rs(rs_decoder); + + + + + + + Structures + + This chapter contains the autogenerated documentation of the structures which are + used in the Reed-Solomon Library and are relevant for a developer. + +!Iinclude/linux/rslib.h + + + + Public Functions Provided + + This chapter contains the autogenerated documentation of the Reed-Solomon functions + which are exported. + +!Elib/reed_solomon/reed_solomon.c + + + + Credits + + The library code for encoding and decoding was written by Phil Karn. + + + Copyright 2002, Phil Karn, KA9Q + May be used under the terms of the GNU General Public License (GPL) + + + The wrapper functions and interfaces are written by Thomas Gleixner + + + Many users have provided bugfixes, improvements and helping hands for testing. + Thanks a lot. + + + The following people have contributed to this document: + + + Thomas Gleixnertglx@linutronix.de + + +
diff --git a/Documentation/DocBook/lsm.tmpl b/Documentation/DocBook/lsm.tmpl new file mode 100644 index 000000000000..f63822195871 --- /dev/null +++ b/Documentation/DocBook/lsm.tmpl @@ -0,0 +1,265 @@ + + + +
+ + Linux Security Modules: General Security Hooks for Linux + + + Stephen + Smalley + + NAI Labs +
ssmalley@nai.com
+
+
+ + Timothy + Fraser + + NAI Labs +
tfraser@nai.com
+
+
+ + Chris + Vance + + NAI Labs +
cvance@nai.com
+
+
+
+
+ +Introduction + + +In March 2001, the National Security Agency (NSA) gave a presentation +about Security-Enhanced Linux (SELinux) at the 2.5 Linux Kernel +Summit. SELinux is an implementation of flexible and fine-grained +nondiscretionary access controls in the Linux kernel, originally +implemented as its own particular kernel patch. Several other +security projects (e.g. RSBAC, Medusa) have also developed flexible +access control architectures for the Linux kernel, and various +projects have developed particular access control models for Linux +(e.g. LIDS, DTE, SubDomain). Each project has developed and +maintained its own kernel patch to support its security needs. + + + +In response to the NSA presentation, Linus Torvalds made a set of +remarks that described a security framework he would be willing to +consider for inclusion in the mainstream Linux kernel. He described a +general framework that would provide a set of security hooks to +control operations on kernel objects and a set of opaque security +fields in kernel data structures for maintaining security attributes. +This framework could then be used by loadable kernel modules to +implement any desired model of security. Linus also suggested the +possibility of migrating the Linux capabilities code into such a +module. + + + +The Linux Security Modules (LSM) project was started by WireX to +develop such a framework. LSM is a joint development effort by +several security projects, including Immunix, SELinux, SGI and Janus, +and several individuals, including Greg Kroah-Hartman and James +Morris, to develop a Linux kernel patch that implements this +framework. The patch is currently tracking the 2.4 series and is +targeted for integration into the 2.5 development series. This +technical report provides an overview of the framework and the example +capabilities security module provided by the LSM kernel patch. + + + + +LSM Framework + + +The LSM kernel patch provides a general kernel framework to support +security modules. In particular, the LSM framework is primarily +focused on supporting access control modules, although future +development is likely to address other security needs such as +auditing. By itself, the framework does not provide any additional +security; it merely provides the infrastructure to support security +modules. The LSM kernel patch also moves most of the capabilities +logic into an optional security module, with the system defaulting +to the traditional superuser logic. This capabilities module +is discussed further in . + + + +The LSM kernel patch adds security fields to kernel data structures +and inserts calls to hook functions at critical points in the kernel +code to manage the security fields and to perform access control. It +also adds functions for registering and unregistering security +modules, and adds a general security system call +to support new system calls for security-aware applications. + + + +The LSM security fields are simply void* pointers. For +process and program execution security information, security fields +were added to struct task_struct and +struct linux_binprm. For filesystem security +information, a security field was added to +struct super_block. For pipe, file, and socket +security information, security fields were added to +struct inode and +struct file. For packet and network device security +information, security fields were added to +struct sk_buff and +struct net_device. For System V IPC security +information, security fields were added to +struct kern_ipc_perm and +struct msg_msg; additionally, the definitions +for struct msg_msg, struct +msg_queue, and struct +shmid_kernel were moved to header files +(include/linux/msg.h and +include/linux/shm.h as appropriate) to allow +the security modules to use these definitions. + + + +Each LSM hook is a function pointer in a global table, +security_ops. This table is a +security_operations structure as defined by +include/linux/security.h. Detailed documentation +for each hook is included in this header file. At present, this +structure consists of a collection of substructures that group related +hooks based on the kernel object (e.g. task, inode, file, sk_buff, +etc) as well as some top-level hook function pointers for system +operations. This structure is likely to be flattened in the future +for performance. The placement of the hook calls in the kernel code +is described by the "called:" lines in the per-hook documentation in +the header file. The hook calls can also be easily found in the +kernel code by looking for the string "security_ops->". + + + + +Linus mentioned per-process security hooks in his original remarks as a +possible alternative to global security hooks. However, if LSM were +to start from the perspective of per-process hooks, then the base +framework would have to deal with how to handle operations that +involve multiple processes (e.g. kill), since each process might have +its own hook for controlling the operation. This would require a +general mechanism for composing hooks in the base framework. +Additionally, LSM would still need global hooks for operations that +have no process context (e.g. network input operations). +Consequently, LSM provides global security hooks, but a security +module is free to implement per-process hooks (where that makes sense) +by storing a security_ops table in each process' security field and +then invoking these per-process hooks from the global hooks. +The problem of composition is thus deferred to the module. + + + +The global security_ops table is initialized to a set of hook +functions provided by a dummy security module that provides +traditional superuser logic. A register_security +function (in security/security.c) is provided to +allow a security module to set security_ops to refer to its own hook +functions, and an unregister_security function is +provided to revert security_ops to the dummy module hooks. This +mechanism is used to set the primary security module, which is +responsible for making the final decision for each hook. + + + +LSM also provides a simple mechanism for stacking additional security +modules with the primary security module. It defines +register_security and +unregister_security hooks in the +security_operations structure and provides +mod_reg_security and +mod_unreg_security functions that invoke these +hooks after performing some sanity checking. A security module can +call these functions in order to stack with other modules. However, +the actual details of how this stacking is handled are deferred to the +module, which can implement these hooks in any way it wishes +(including always returning an error if it does not wish to support +stacking). In this manner, LSM again defers the problem of +composition to the module. + + + +Although the LSM hooks are organized into substructures based on +kernel object, all of the hooks can be viewed as falling into two +major categories: hooks that are used to manage the security fields +and hooks that are used to perform access control. Examples of the +first category of hooks include the +alloc_security and +free_security hooks defined for each kernel data +structure that has a security field. These hooks are used to allocate +and free security structures for kernel objects. The first category +of hooks also includes hooks that set information in the security +field after allocation, such as the post_lookup +hook in struct inode_security_ops. This hook +is used to set security information for inodes after successful lookup +operations. An example of the second category of hooks is the +permission hook in +struct inode_security_ops. This hook checks +permission when accessing an inode. + + + + +LSM Capabilities Module + + +The LSM kernel patch moves most of the existing POSIX.1e capabilities +logic into an optional security module stored in the file +security/capability.c. This change allows +users who do not want to use capabilities to omit this code entirely +from their kernel, instead using the dummy module for traditional +superuser logic or any other module that they desire. This change +also allows the developers of the capabilities logic to maintain and +enhance their code more freely, without needing to integrate patches +back into the base kernel. + + + +In addition to moving the capabilities logic, the LSM kernel patch +could move the capability-related fields from the kernel data +structures into the new security fields managed by the security +modules. However, at present, the LSM kernel patch leaves the +capability fields in the kernel data structures. In his original +remarks, Linus suggested that this might be preferable so that other +security modules can be easily stacked with the capabilities module +without needing to chain multiple security structures on the security field. +It also avoids imposing extra overhead on the capabilities module +to manage the security fields. However, the LSM framework could +certainly support such a move if it is determined to be desirable, +with only a few additional changes described below. + + + +At present, the capabilities logic for computing process capabilities +on execve and set*uid, +checking capabilities for a particular process, saving and checking +capabilities for netlink messages, and handling the +capget and capset system +calls have been moved into the capabilities module. There are still a +few locations in the base kernel where capability-related fields are +directly examined or modified, but the current version of the LSM +patch does allow a security module to completely replace the +assignment and testing of capabilities. These few locations would +need to be changed if the capability-related fields were moved into +the security field. The following is a list of known locations that +still perform such direct examination or modification of +capability-related fields: + +fs/open.c:sys_access +fs/lockd/host.c:nlm_bind_host +fs/nfsd/auth.c:nfsd_setuser +fs/proc/array.c:task_cap + + + + + +
diff --git a/Documentation/DocBook/man/Makefile b/Documentation/DocBook/man/Makefile new file mode 100644 index 000000000000..4fb7ea0f7ac8 --- /dev/null +++ b/Documentation/DocBook/man/Makefile @@ -0,0 +1,3 @@ +# Rules are put in Documentation/DocBook + +clean-files := *.9.gz *.sgml manpage.links manpage.refs diff --git a/Documentation/DocBook/mcabook.tmpl b/Documentation/DocBook/mcabook.tmpl new file mode 100644 index 000000000000..4367f4642f3d --- /dev/null +++ b/Documentation/DocBook/mcabook.tmpl @@ -0,0 +1,107 @@ + + + + + + MCA Driver Programming Interface + + + + Alan + Cox + +
+ alan@redhat.com +
+
+
+ + David + Weinehall + + + Chris + Beauregard + +
+ + + 2000 + Alan Cox + David Weinehall + Chris Beauregard + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + +
+ + + + + Introduction + + The MCA bus functions provide a generalised interface to find MCA + bus cards, to claim them for a driver, and to read and manipulate POS + registers without being aware of the motherboard internals or + certain deep magic specific to onboard devices. + + + The basic interface to the MCA bus devices is the slot. Each slot + is numbered and virtual slot numbers are assigned to the internal + devices. Using a pci_dev as other busses do does not really make + sense in the MCA context as the MCA bus resources require card + specific interpretation. + + + Finally the MCA bus functions provide a parallel set of DMA + functions mimicing the ISA bus DMA functions as closely as possible, + although also supporting the additional DMA functionality on the + MCA bus controllers. + + + + Known Bugs And Assumptions + + None. + + + + + Public Functions Provided +!Earch/i386/kernel/mca.c + + + + DMA Functions Provided +!Iinclude/asm-i386/mca_dma.h + + +
diff --git a/Documentation/DocBook/mtdnand.tmpl b/Documentation/DocBook/mtdnand.tmpl new file mode 100644 index 000000000000..6e463d0db266 --- /dev/null +++ b/Documentation/DocBook/mtdnand.tmpl @@ -0,0 +1,1320 @@ + + + + + + MTD NAND Driver Programming Interface + + + + Thomas + Gleixner + +
+ tglx@linutronix.de +
+
+
+
+ + + 2004 + Thomas Gleixner + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License version 2 as published by the Free Software Foundation. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + +
+ + + + + Introduction + + The generic NAND driver supports almost all NAND and AG-AND based + chips and connects them to the Memory Technology Devices (MTD) + subsystem of the Linux Kernel. + + + This documentation is provided for developers who want to implement + board drivers or filesystem drivers suitable for NAND devices. + + + + + Known Bugs And Assumptions + + None. + + + + + Documentation hints + + The function and structure docs are autogenerated. Each function and + struct member has a short description which is marked with an [XXX] identifier. + The following chapters explain the meaning of those identifiers. + + + Function identifiers [XXX] + + The functions are marked with [XXX] identifiers in the short + comment. The identifiers explain the usage and scope of the + functions. Following identifiers are used: + + + + [MTD Interface] + These functions provide the interface to the MTD kernel API. + They are not replacable and provide functionality + which is complete hardware independent. + + + [NAND Interface] + These functions are exported and provide the interface to the NAND kernel API. + + + [GENERIC] + Generic functions are not replacable and provide functionality + which is complete hardware independent. + + + [DEFAULT] + Default functions provide hardware related functionality which is suitable + for most of the implementations. These functions can be replaced by the + board driver if neccecary. Those functions are called via pointers in the + NAND chip description structure. The board driver can set the functions which + should be replaced by board dependend functions before calling nand_scan(). + If the function pointer is NULL on entry to nand_scan() then the pointer + is set to the default function which is suitable for the detected chip type. + + + + + Struct member identifiers [XXX] + + The struct members are marked with [XXX] identifiers in the + comment. The identifiers explain the usage and scope of the + members. Following identifiers are used: + + + + [INTERN] + These members are for NAND driver internal use only and must not be + modified. Most of these values are calculated from the chip geometry + information which is evaluated during nand_scan(). + + + [REPLACEABLE] + Replaceable members hold hardware related functions which can be + provided by the board driver. The board driver can set the functions which + should be replaced by board dependend functions before calling nand_scan(). + If the function pointer is NULL on entry to nand_scan() then the pointer + is set to the default function which is suitable for the detected chip type. + + + [BOARDSPECIFIC] + Board specific members hold hardware related information which must + be provided by the board driver. The board driver must set the function + pointers and datafields before calling nand_scan(). + + + [OPTIONAL] + Optional members can hold information relevant for the board driver. The + generic NAND driver code does not use this information. + + + + + + + Basic board driver + + For most boards it will be sufficient to provide just the + basic functions and fill out some really board dependend + members in the nand chip description structure. + See drivers/mtd/nand/skeleton for reference. + + + Basic defines + + At least you have to provide a mtd structure and + a storage for the ioremap'ed chip address. + You can allocate the mtd structure using kmalloc + or you can allocate it statically. + In case of static allocation you have to allocate + a nand_chip structure too. + + + Kmalloc based example + + +static struct mtd_info *board_mtd; +static unsigned long baseaddr; + + + Static example + + +static struct mtd_info board_mtd; +static struct nand_chip board_chip; +static unsigned long baseaddr; + + + + Partition defines + + If you want to divide your device into parititions, then + enable the configuration switch CONFIG_MTD_PARITIONS and define + a paritioning scheme suitable to your board. + + +#define NUM_PARTITIONS 2 +static struct mtd_partition partition_info[] = { + { .name = "Flash partition 1", + .offset = 0, + .size = 8 * 1024 * 1024 }, + { .name = "Flash partition 2", + .offset = MTDPART_OFS_NEXT, + .size = MTDPART_SIZ_FULL }, +}; + + + + Hardware control function + + The hardware control function provides access to the + control pins of the NAND chip(s). + The access can be done by GPIO pins or by address lines. + If you use address lines, make sure that the timing + requirements are met. + + + GPIO based example + + +static void board_hwcontrol(struct mtd_info *mtd, int cmd) +{ + switch(cmd){ + case NAND_CTL_SETCLE: /* Set CLE pin high */ break; + case NAND_CTL_CLRCLE: /* Set CLE pin low */ break; + case NAND_CTL_SETALE: /* Set ALE pin high */ break; + case NAND_CTL_CLRALE: /* Set ALE pin low */ break; + case NAND_CTL_SETNCE: /* Set nCE pin low */ break; + case NAND_CTL_CLRNCE: /* Set nCE pin high */ break; + } +} + + + Address lines based example. It's assumed that the + nCE pin is driven by a chip select decoder. + + +static void board_hwcontrol(struct mtd_info *mtd, int cmd) +{ + struct nand_chip *this = (struct nand_chip *) mtd->priv; + switch(cmd){ + case NAND_CTL_SETCLE: this->IO_ADDR_W |= CLE_ADRR_BIT; break; + case NAND_CTL_CLRCLE: this->IO_ADDR_W &= ~CLE_ADRR_BIT; break; + case NAND_CTL_SETALE: this->IO_ADDR_W |= ALE_ADRR_BIT; break; + case NAND_CTL_CLRALE: this->IO_ADDR_W &= ~ALE_ADRR_BIT; break; + } +} + + + + Device ready function + + If the hardware interface has the ready busy pin of the NAND chip connected to a + GPIO or other accesible I/O pin, this function is used to read back the state of the + pin. The function has no arguments and should return 0, if the device is busy (R/B pin + is low) and 1, if the device is ready (R/B pin is high). + If the hardware interface does not give access to the ready busy pin, then + the function must not be defined and the function pointer this->dev_ready is set to NULL. + + + + Init function + + The init function allocates memory and sets up all the board + specific parameters and function pointers. When everything + is set up nand_scan() is called. This function tries to + detect and identify then chip. If a chip is found all the + internal data fields are initialized accordingly. + The structure(s) have to be zeroed out first and then filled with the neccecary + information about the device. + + +int __init board_init (void) +{ + struct nand_chip *this; + int err = 0; + + /* Allocate memory for MTD device structure and private data */ + board_mtd = kmalloc (sizeof(struct mtd_info) + sizeof (struct nand_chip), GFP_KERNEL); + if (!board_mtd) { + printk ("Unable to allocate NAND MTD device structure.\n"); + err = -ENOMEM; + goto out; + } + + /* Initialize structures */ + memset ((char *) board_mtd, 0, sizeof(struct mtd_info) + sizeof(struct nand_chip)); + + /* map physical adress */ + baseaddr = (unsigned long)ioremap(CHIP_PHYSICAL_ADDRESS, 1024); + if(!baseaddr){ + printk("Ioremap to access NAND chip failed\n"); + err = -EIO; + goto out_mtd; + } + + /* Get pointer to private data */ + this = (struct nand_chip *) (); + /* Link the private data with the MTD structure */ + board_mtd->priv = this; + + /* Set address of NAND IO lines */ + this->IO_ADDR_R = baseaddr; + this->IO_ADDR_W = baseaddr; + /* Reference hardware control function */ + this->hwcontrol = board_hwcontrol; + /* Set command delay time, see datasheet for correct value */ + this->chip_delay = CHIP_DEPENDEND_COMMAND_DELAY; + /* Assign the device ready function, if available */ + this->dev_ready = board_dev_ready; + this->eccmode = NAND_ECC_SOFT; + + /* Scan to find existance of the device */ + if (nand_scan (board_mtd, 1)) { + err = -ENXIO; + goto out_ior; + } + + add_mtd_partitions(board_mtd, partition_info, NUM_PARTITIONS); + goto out; + +out_ior: + iounmap((void *)baseaddr); +out_mtd: + kfree (board_mtd); +out: + return err; +} +module_init(board_init); + + + + Exit function + + The exit function is only neccecary if the driver is + compiled as a module. It releases all resources which + are held by the chip driver and unregisters the partitions + in the MTD layer. + + +#ifdef MODULE +static void __exit board_cleanup (void) +{ + /* Release resources, unregister device */ + nand_release (board_mtd); + + /* unmap physical adress */ + iounmap((void *)baseaddr); + + /* Free the MTD device structure */ + kfree (board_mtd); +} +module_exit(board_cleanup); +#endif + + + + + + Advanced board driver functions + + This chapter describes the advanced functionality of the NAND + driver. For a list of functions which can be overridden by the board + driver see the documentation of the nand_chip structure. + + + Multiple chip control + + The nand driver can control chip arrays. Therefor the + board driver must provide an own select_chip function. This + function must (de)select the requested chip. + The function pointer in the nand_chip structure must + be set before calling nand_scan(). The maxchip parameter + of nand_scan() defines the maximum number of chips to + scan for. Make sure that the select_chip function can + handle the requested number of chips. + + + The nand driver concatenates the chips to one virtual + chip and provides this virtual chip to the MTD layer. + + + Note: The driver can only handle linear chip arrays + of equally sized chips. There is no support for + parallel arrays which extend the buswidth. + + + GPIO based example + + +static void board_select_chip (struct mtd_info *mtd, int chip) +{ + /* Deselect all chips, set all nCE pins high */ + GPIO(BOARD_NAND_NCE) |= 0xff; + if (chip >= 0) + GPIO(BOARD_NAND_NCE) &= ~ (1 << chip); +} + + + Address lines based example. + Its assumed that the nCE pins are connected to an + address decoder. + + +static void board_select_chip (struct mtd_info *mtd, int chip) +{ + struct nand_chip *this = (struct nand_chip *) mtd->priv; + + /* Deselect all chips */ + this->IO_ADDR_R &= ~BOARD_NAND_ADDR_MASK; + this->IO_ADDR_W &= ~BOARD_NAND_ADDR_MASK; + switch (chip) { + case 0: + this->IO_ADDR_R |= BOARD_NAND_ADDR_CHIP0; + this->IO_ADDR_W |= BOARD_NAND_ADDR_CHIP0; + break; + .... + case n: + this->IO_ADDR_R |= BOARD_NAND_ADDR_CHIPn; + this->IO_ADDR_W |= BOARD_NAND_ADDR_CHIPn; + break; + } +} + + + + Hardware ECC support + + Functions and constants + + The nand driver supports three different types of + hardware ECC. + + NAND_ECC_HW3_256 + Hardware ECC generator providing 3 bytes ECC per + 256 byte. + + NAND_ECC_HW3_512 + Hardware ECC generator providing 3 bytes ECC per + 512 byte. + + NAND_ECC_HW6_512 + Hardware ECC generator providing 6 bytes ECC per + 512 byte. + + NAND_ECC_HW8_512 + Hardware ECC generator providing 6 bytes ECC per + 512 byte. + + + If your hardware generator has a different functionality + add it at the appropriate place in nand_base.c + + + The board driver must provide following functions: + + enable_hwecc + This function is called before reading / writing to + the chip. Reset or initialize the hardware generator + in this function. The function is called with an + argument which let you distinguish between read + and write operations. + + calculate_ecc + This function is called after read / write from / to + the chip. Transfer the ECC from the hardware to + the buffer. If the option NAND_HWECC_SYNDROME is set + then the function is only called on write. See below. + + correct_data + In case of an ECC error this function is called for + error detection and correction. Return 1 respectively 2 + in case the error can be corrected. If the error is + not correctable return -1. If your hardware generator + matches the default algorithm of the nand_ecc software + generator then use the correction function provided + by nand_ecc instead of implementing duplicated code. + + + + + + Hardware ECC with syndrome calculation + + Many hardware ECC implementations provide Reed-Solomon + codes and calculate an error syndrome on read. The syndrome + must be converted to a standard Reed-Solomon syndrome + before calling the error correction code in the generic + Reed-Solomon library. + + + The ECC bytes must be placed immidiately after the data + bytes in order to make the syndrome generator work. This + is contrary to the usual layout used by software ECC. The + seperation of data and out of band area is not longer + possible. The nand driver code handles this layout and + the remaining free bytes in the oob area are managed by + the autoplacement code. Provide a matching oob-layout + in this case. See rts_from4.c and diskonchip.c for + implementation reference. In those cases we must also + use bad block tables on FLASH, because the ECC layout is + interferring with the bad block marker positions. + See bad block table support for details. + + + + + Bad block table support + + Most NAND chips mark the bad blocks at a defined + position in the spare area. Those blocks must + not be erased under any circumstances as the bad + block information would be lost. + It is possible to check the bad block mark each + time when the blocks are accessed by reading the + spare area of the first page in the block. This + is time consuming so a bad block table is used. + + + The nand driver supports various types of bad block + tables. + + Per device + The bad block table contains all bad block information + of the device which can consist of multiple chips. + + Per chip + A bad block table is used per chip and contains the + bad block information for this particular chip. + + Fixed offset + The bad block table is located at a fixed offset + in the chip (device). This applies to various + DiskOnChip devices. + + Automatic placed + The bad block table is automatically placed and + detected either at the end or at the beginning + of a chip (device) + + Mirrored tables + The bad block table is mirrored on the chip (device) to + allow updates of the bad block table without data loss. + + + + + nand_scan() calls the function nand_default_bbt(). + nand_default_bbt() selects appropriate default + bad block table desriptors depending on the chip information + which was retrieved by nand_scan(). + + + The standard policy is scanning the device for bad + blocks and build a ram based bad block table which + allows faster access than always checking the + bad block information on the flash chip itself. + + + Flash based tables + + It may be desired or neccecary to keep a bad block table in FLASH. + For AG-AND chips this is mandatory, as they have no factory marked + bad blocks. They have factory marked good blocks. The marker pattern + is erased when the block is erased to be reused. So in case of + powerloss before writing the pattern back to the chip this block + would be lost and added to the bad blocks. Therefor we scan the + chip(s) when we detect them the first time for good blocks and + store this information in a bad block table before erasing any + of the blocks. + + + The blocks in which the tables are stored are procteted against + accidental access by marking them bad in the memory bad block + table. The bad block table managment functions are allowed + to circumvernt this protection. + + + The simplest way to activate the FLASH based bad block table support + is to set the option NAND_USE_FLASH_BBT in the option field of + the nand chip structure before calling nand_scan(). For AG-AND + chips is this done by default. + This activates the default FLASH based bad block table functionality + of the NAND driver. The default bad block table options are + + Store bad block table per chip + Use 2 bits per block + Automatic placement at the end of the chip + Use mirrored tables with version numbers + Reserve 4 blocks at the end of the chip + + + + + User defined tables + + User defined tables are created by filling out a + nand_bbt_descr structure and storing the pointer in the + nand_chip structure member bbt_td before calling nand_scan(). + If a mirror table is neccecary a second structure must be + created and a pointer to this structure must be stored + in bbt_md inside the nand_chip structure. If the bbt_md + member is set to NULL then only the main table is used + and no scan for the mirrored table is performed. + + + The most important field in the nand_bbt_descr structure + is the options field. The options define most of the + table properties. Use the predefined constants from + nand.h to define the options. + + Number of bits per block + The supported number of bits is 1, 2, 4, 8. + Table per chip + Setting the constant NAND_BBT_PERCHIP selects that + a bad block table is managed for each chip in a chip array. + If this option is not set then a per device bad block table + is used. + Table location is absolute + Use the option constant NAND_BBT_ABSPAGE and + define the absolute page number where the bad block + table starts in the field pages. If you have selected bad block + tables per chip and you have a multi chip array then the start page + must be given for each chip in the chip array. Note: there is no scan + for a table ident pattern performed, so the fields + pattern, veroffs, offs, len can be left uninitialized + Table location is automatically detected + The table can either be located in the first or the last good + blocks of the chip (device). Set NAND_BBT_LASTBLOCK to place + the bad block table at the end of the chip (device). The + bad block tables are marked and identified by a pattern which + is stored in the spare area of the first page in the block which + holds the bad block table. Store a pointer to the pattern + in the pattern field. Further the length of the pattern has to be + stored in len and the offset in the spare area must be given + in the offs member of the nand_bbt_descr stucture. For mirrored + bad block tables different patterns are mandatory. + Table creation + Set the option NAND_BBT_CREATE to enable the table creation + if no table can be found during the scan. Usually this is done only + once if a new chip is found. + Table write support + Set the option NAND_BBT_WRITE to enable the table write support. + This allows the update of the bad block table(s) in case a block has + to be marked bad due to wear. The MTD interface function block_markbad + is calling the update function of the bad block table. If the write + support is enabled then the table is updated on FLASH. + + Note: Write support should only be enabled for mirrored tables with + version control. + + Table version control + Set the option NAND_BBT_VERSION to enable the table version control. + It's highly recommended to enable this for mirrored tables with write + support. It makes sure that the risk of loosing the bad block + table information is reduced to the loss of the information about the + one worn out block which should be marked bad. The version is stored in + 4 consecutive bytes in the spare area of the device. The position of + the version number is defined by the member veroffs in the bad block table + descriptor. + Save block contents on write + + In case that the block which holds the bad block table does contain + other useful information, set the option NAND_BBT_SAVECONTENT. When + the bad block table is written then the whole block is read the bad + block table is updated and the block is erased and everything is + written back. If this option is not set only the bad block table + is written and everything else in the block is ignored and erased. + + Number of reserved blocks + + For automatic placement some blocks must be reserved for + bad block table storage. The number of reserved blocks is defined + in the maxblocks member of the babd block table description structure. + Reserving 4 blocks for mirrored tables should be a reasonable number. + This also limits the number of blocks which are scanned for the bad + block table ident pattern. + + + + + + + Spare area (auto)placement + + The nand driver implements different possibilities for + placement of filesystem data in the spare area, + + Placement defined by fs driver + Automatic placement + + The default placement function is automatic placement. The + nand driver has built in default placement schemes for the + various chiptypes. If due to hardware ECC functionality the + default placement does not fit then the board driver can + provide a own placement scheme. + + + File system drivers can provide a own placement scheme which + is used instead of the default placement scheme. + + + Placement schemes are defined by a nand_oobinfo structure + +struct nand_oobinfo { + int useecc; + int eccbytes; + int eccpos[24]; + int oobfree[8][2]; +}; + + + useecc + The useecc member controls the ecc and placement function. The header + file include/mtd/mtd-abi.h contains constants to select ecc and + placement. MTD_NANDECC_OFF switches off the ecc complete. This is + not recommended and available for testing and diagnosis only. + MTD_NANDECC_PLACE selects caller defined placement, MTD_NANDECC_AUTOPLACE + selects automatic placement. + + eccbytes + The eccbytes member defines the number of ecc bytes per page. + + eccpos + The eccpos array holds the byte offsets in the spare area where + the ecc codes are placed. + + oobfree + The oobfree array defines the areas in the spare area which can be + used for automatic placement. The information is given in the format + {offset, size}. offset defines the start of the usable area, size the + length in bytes. More than one area can be defined. The list is terminated + by an {0, 0} entry. + + + + + Placement defined by fs driver + + The calling function provides a pointer to a nand_oobinfo + structure which defines the ecc placement. For writes the + caller must provide a spare area buffer along with the + data buffer. The spare area buffer size is (number of pages) * + (size of spare area). For reads the buffer size is + (number of pages) * ((size of spare area) + (number of ecc + steps per page) * sizeof (int)). The driver stores the + result of the ecc check for each tuple in the spare buffer. + The storage sequence is + + + <spare data page 0><ecc result 0>...<ecc result n> + + + ... + + + <spare data page n><ecc result 0>...<ecc result n> + + + This is a legacy mode used by YAFFS1. + + + If the spare area buffer is NULL then only the ECC placement is + done according to the given scheme in the nand_oobinfo structure. + + + + Automatic placement + + Automatic placement uses the built in defaults to place the + ecc bytes in the spare area. If filesystem data have to be stored / + read into the spare area then the calling function must provide a + buffer. The buffer size per page is determined by the oobfree array in + the nand_oobinfo structure. + + + If the spare area buffer is NULL then only the ECC placement is + done according to the default builtin scheme. + + + + User space placement selection + + All non ecc functions like mtd->read and mtd->write use an internal + structure, which can be set by an ioctl. This structure is preset + to the autoplacement default. + + ioctl (fd, MEMSETOOBSEL, oobsel); + + oobsel is a pointer to a user supplied structure of type + nand_oobconfig. The contents of this structure must match the + criteria of the filesystem, which will be used. See an example in utils/nandwrite.c. + + + + + Spare area autoplacement default schemes + + 256 byte pagesize + + +Offset +Content +Comment + + +0x00 +ECC byte 0 +Error correction code byte 0 + + +0x01 +ECC byte 1 +Error correction code byte 1 + + +0x02 +ECC byte 2 +Error correction code byte 2 + + +0x03 +Autoplace 0 + + + +0x04 +Autoplace 1 + + + +0x05 +Bad block marker +If any bit in this byte is zero, then this block is bad. +This applies only to the first page in a block. In the remaining +pages this byte is reserved + + +0x06 +Autoplace 2 + + + +0x07 +Autoplace 3 + + + + + + 512 byte pagesize + + +Offset +Content +Comment + + +0x00 +ECC byte 0 +Error correction code byte 0 of the lower 256 Byte data in +this page + + +0x01 +ECC byte 1 +Error correction code byte 1 of the lower 256 Bytes of data +in this page + + +0x02 +ECC byte 2 +Error correction code byte 2 of the lower 256 Bytes of data +in this page + + +0x03 +ECC byte 3 +Error correction code byte 0 of the upper 256 Bytes of data +in this page + + +0x04 +reserved +reserved + + +0x05 +Bad block marker +If any bit in this byte is zero, then this block is bad. +This applies only to the first page in a block. In the remaining +pages this byte is reserved + + +0x06 +ECC byte 4 +Error correction code byte 1 of the upper 256 Bytes of data +in this page + + +0x07 +ECC byte 5 +Error correction code byte 2 of the upper 256 Bytes of data +in this page + + +0x08 - 0x0F +Autoplace 0 - 7 + + + + + + 2048 byte pagesize + + +Offset +Content +Comment + + +0x00 +Bad block marker +If any bit in this byte is zero, then this block is bad. +This applies only to the first page in a block. In the remaining +pages this byte is reserved + + +0x01 +Reserved +Reserved + + +0x02-0x27 +Autoplace 0 - 37 + + + +0x28 +ECC byte 0 +Error correction code byte 0 of the first 256 Byte data in +this page + + +0x29 +ECC byte 1 +Error correction code byte 1 of the first 256 Bytes of data +in this page + + +0x2A +ECC byte 2 +Error correction code byte 2 of the first 256 Bytes data in +this page + + +0x2B +ECC byte 3 +Error correction code byte 0 of the second 256 Bytes of data +in this page + + +0x2C +ECC byte 4 +Error correction code byte 1 of the second 256 Bytes of data +in this page + + +0x2D +ECC byte 5 +Error correction code byte 2 of the second 256 Bytes of data +in this page + + +0x2E +ECC byte 6 +Error correction code byte 0 of the third 256 Bytes of data +in this page + + +0x2F +ECC byte 7 +Error correction code byte 1 of the third 256 Bytes of data +in this page + + +0x30 +ECC byte 8 +Error correction code byte 2 of the third 256 Bytes of data +in this page + + +0x31 +ECC byte 9 +Error correction code byte 0 of the fourth 256 Bytes of data +in this page + + +0x32 +ECC byte 10 +Error correction code byte 1 of the fourth 256 Bytes of data +in this page + + +0x33 +ECC byte 11 +Error correction code byte 2 of the fourth 256 Bytes of data +in this page + + +0x34 +ECC byte 12 +Error correction code byte 0 of the fifth 256 Bytes of data +in this page + + +0x35 +ECC byte 13 +Error correction code byte 1 of the fifth 256 Bytes of data +in this page + + +0x36 +ECC byte 14 +Error correction code byte 2 of the fifth 256 Bytes of data +in this page + + +0x37 +ECC byte 15 +Error correction code byte 0 of the sixt 256 Bytes of data +in this page + + +0x38 +ECC byte 16 +Error correction code byte 1 of the sixt 256 Bytes of data +in this page + + +0x39 +ECC byte 17 +Error correction code byte 2 of the sixt 256 Bytes of data +in this page + + +0x3A +ECC byte 18 +Error correction code byte 0 of the seventh 256 Bytes of +data in this page + + +0x3B +ECC byte 19 +Error correction code byte 1 of the seventh 256 Bytes of +data in this page + + +0x3C +ECC byte 20 +Error correction code byte 2 of the seventh 256 Bytes of +data in this page + + +0x3D +ECC byte 21 +Error correction code byte 0 of the eigth 256 Bytes of data +in this page + + +0x3E +ECC byte 22 +Error correction code byte 1 of the eigth 256 Bytes of data +in this page + + +0x3F +ECC byte 23 +Error correction code byte 2 of the eigth 256 Bytes of data +in this page + + + + + + + + Filesystem support + + The NAND driver provides all neccecary functions for a + filesystem via the MTD interface. + + + Filesystems must be aware of the NAND pecularities and + restrictions. One major restrictions of NAND Flash is, that you cannot + write as often as you want to a page. The consecutive writes to a page, + before erasing it again, are restricted to 1-3 writes, depending on the + manufacturers specifications. This applies similar to the spare area. + + + Therefor NAND aware filesystems must either write in page size chunks + or hold a writebuffer to collect smaller writes until they sum up to + pagesize. Available NAND aware filesystems: JFFS2, YAFFS. + + + The spare area usage to store filesystem data is controlled by + the spare area placement functionality which is described in one + of the earlier chapters. + + + + Tools + + The MTD project provides a couple of helpful tools to handle NAND Flash. + + flasherase, flasheraseall: Erase and format FLASH partitions + nandwrite: write filesystem images to NAND FLASH + nanddump: dump the contents of a NAND FLASH partitions + + + + These tools are aware of the NAND restrictions. Please use those tools + instead of complaining about errors which are caused by non NAND aware + access methods. + + + + + Constants + + This chapter describes the constants which might be relevant for a driver developer. + + + Chip option constants + + Constants for chip id table + + These constants are defined in nand.h. They are ored together to describe + the chip functionality. + +/* Chip can not auto increment pages */ +#define NAND_NO_AUTOINCR 0x00000001 +/* Buswitdh is 16 bit */ +#define NAND_BUSWIDTH_16 0x00000002 +/* Device supports partial programming without padding */ +#define NAND_NO_PADDING 0x00000004 +/* Chip has cache program function */ +#define NAND_CACHEPRG 0x00000008 +/* Chip has copy back function */ +#define NAND_COPYBACK 0x00000010 +/* AND Chip which has 4 banks and a confusing page / block + * assignment. See Renesas datasheet for further information */ +#define NAND_IS_AND 0x00000020 +/* Chip has a array of 4 pages which can be read without + * additional ready /busy waits */ +#define NAND_4PAGE_ARRAY 0x00000040 + + + + + Constants for runtime options + + These constants are defined in nand.h. They are ored together to describe + the functionality. + +/* Use a flash based bad block table. This option is parsed by the + * default bad block table function (nand_default_bbt). */ +#define NAND_USE_FLASH_BBT 0x00010000 +/* The hw ecc generator provides a syndrome instead a ecc value on read + * This can only work if we have the ecc bytes directly behind the + * data bytes. Applies for DOC and AG-AND Renesas HW Reed Solomon generators */ +#define NAND_HWECC_SYNDROME 0x00020000 + + + + + + + ECC selection constants + + Use these constants to select the ECC algorithm. + +/* No ECC. Usage is not recommended ! */ +#define NAND_ECC_NONE 0 +/* Software ECC 3 byte ECC per 256 Byte data */ +#define NAND_ECC_SOFT 1 +/* Hardware ECC 3 byte ECC per 256 Byte data */ +#define NAND_ECC_HW3_256 2 +/* Hardware ECC 3 byte ECC per 512 Byte data */ +#define NAND_ECC_HW3_512 3 +/* Hardware ECC 6 byte ECC per 512 Byte data */ +#define NAND_ECC_HW6_512 4 +/* Hardware ECC 6 byte ECC per 512 Byte data */ +#define NAND_ECC_HW8_512 6 + + + + + + Hardware control related constants + + These constants describe the requested hardware access function when + the boardspecific hardware control function is called + +/* Select the chip by setting nCE to low */ +#define NAND_CTL_SETNCE 1 +/* Deselect the chip by setting nCE to high */ +#define NAND_CTL_CLRNCE 2 +/* Select the command latch by setting CLE to high */ +#define NAND_CTL_SETCLE 3 +/* Deselect the command latch by setting CLE to low */ +#define NAND_CTL_CLRCLE 4 +/* Select the address latch by setting ALE to high */ +#define NAND_CTL_SETALE 5 +/* Deselect the address latch by setting ALE to low */ +#define NAND_CTL_CLRALE 6 +/* Set write protection by setting WP to high. Not used! */ +#define NAND_CTL_SETWP 7 +/* Clear write protection by setting WP to low. Not used! */ +#define NAND_CTL_CLRWP 8 + + + + + + Bad block table related constants + + These constants describe the options used for bad block + table descriptors. + +/* Options for the bad block table descriptors */ + +/* The number of bits used per block in the bbt on the device */ +#define NAND_BBT_NRBITS_MSK 0x0000000F +#define NAND_BBT_1BIT 0x00000001 +#define NAND_BBT_2BIT 0x00000002 +#define NAND_BBT_4BIT 0x00000004 +#define NAND_BBT_8BIT 0x00000008 +/* The bad block table is in the last good block of the device */ +#define NAND_BBT_LASTBLOCK 0x00000010 +/* The bbt is at the given page, else we must scan for the bbt */ +#define NAND_BBT_ABSPAGE 0x00000020 +/* The bbt is at the given page, else we must scan for the bbt */ +#define NAND_BBT_SEARCH 0x00000040 +/* bbt is stored per chip on multichip devices */ +#define NAND_BBT_PERCHIP 0x00000080 +/* bbt has a version counter at offset veroffs */ +#define NAND_BBT_VERSION 0x00000100 +/* Create a bbt if none axists */ +#define NAND_BBT_CREATE 0x00000200 +/* Search good / bad pattern through all pages of a block */ +#define NAND_BBT_SCANALLPAGES 0x00000400 +/* Scan block empty during good / bad block scan */ +#define NAND_BBT_SCANEMPTY 0x00000800 +/* Write bbt if neccecary */ +#define NAND_BBT_WRITE 0x00001000 +/* Read and write back block contents when writing bbt */ +#define NAND_BBT_SAVECONTENT 0x00002000 + + + + + + + + Structures + + This chapter contains the autogenerated documentation of the structures which are + used in the NAND driver and might be relevant for a driver developer. Each + struct member has a short description which is marked with an [XXX] identifier. + See the chapter "Documentation hints" for an explanation. + +!Iinclude/linux/mtd/nand.h + + + + Public Functions Provided + + This chapter contains the autogenerated documentation of the NAND kernel API functions + which are exported. Each function has a short description which is marked with an [XXX] identifier. + See the chapter "Documentation hints" for an explanation. + +!Edrivers/mtd/nand/nand_base.c +!Edrivers/mtd/nand/nand_bbt.c +!Edrivers/mtd/nand/nand_ecc.c + + + + Internal Functions Provided + + This chapter contains the autogenerated documentation of the NAND driver internal functions. + Each function has a short description which is marked with an [XXX] identifier. + See the chapter "Documentation hints" for an explanation. + The functions marked with [DEFAULT] might be relevant for a board driver developer. + +!Idrivers/mtd/nand/nand_base.c +!Idrivers/mtd/nand/nand_bbt.c +!Idrivers/mtd/nand/nand_ecc.c + + + + Credits + + The following people have contributed to the NAND driver: + + Steven J. Hillsjhill@realitydiluted.com + David Woodhousedwmw2@infradead.org + Thomas Gleixnertglx@linutronix.de + + A lot of users have provided bugfixes, improvements and helping hands for testing. + Thanks a lot. + + + The following people have contributed to this document: + + Thomas Gleixnertglx@linutronix.de + + + +
diff --git a/Documentation/DocBook/procfs-guide.tmpl b/Documentation/DocBook/procfs-guide.tmpl new file mode 100644 index 000000000000..45cad23efefa --- /dev/null +++ b/Documentation/DocBook/procfs-guide.tmpl @@ -0,0 +1,591 @@ + + +]> + + + + Linux Kernel Procfs Guide + + + + Erik + (J.A.K.) + Mouw + + Delft University of Technology + Faculty of Information Technology and Systems +
+ J.A.K.Mouw@its.tudelft.nl + PO BOX 5031 + 2600 GA + Delft + The Netherlands +
+
+
+
+ + + + 1.0  + May 30, 2001 + Initial revision posted to linux-kernel + + + 1.1  + June 3, 2001 + Revised after comments from linux-kernel + + + + + 2001 + Erik Mouw + + + + + + This documentation is free software; you can redistribute it + and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This documentation is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR + PURPOSE. See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + +
+ + + + + + + + + + + + Preface + + + This guide describes the use of the procfs file system from + within the Linux kernel. The idea to write this guide came up on + the #kernelnewbies IRC channel (see http://www.kernelnewbies.org/), + when Jeff Garzik explained the use of procfs and forwarded me a + message Alexander Viro wrote to the linux-kernel mailing list. I + agreed to write it up nicely, so here it is. + + + + I'd like to thank Jeff Garzik + jgarzik@pobox.com and Alexander Viro + viro@parcelfarce.linux.theplanet.co.uk for their input, + Tim Waugh twaugh@redhat.com for his Selfdocbook, + and Marc Joosen marcj@historia.et.tudelft.nl for + proofreading. + + + + This documentation was written while working on the LART + computing board (http://www.lart.tudelft.nl/), + which is sponsored by the Mobile Multi-media Communications + (http://www.mmc.tudelft.nl/) + and Ubiquitous Communications (http://www.ubicom.tudelft.nl/) + projects. + + + + Erik + + + + + + + + Introduction + + + The /proc file system + (procfs) is a special file system in the linux kernel. It's a + virtual file system: it is not associated with a block device + but exists only in memory. The files in the procfs are there to + allow userland programs access to certain information from the + kernel (like process information in /proc/[0-9]+/), but also for debug + purposes (like /proc/ksyms). + + + + This guide describes the use of the procfs file system from + within the Linux kernel. It starts by introducing all relevant + functions to manage the files within the file system. After that + it shows how to communicate with userland, and some tips and + tricks will be pointed out. Finally a complete example will be + shown. + + + + Note that the files in /proc/sys are sysctl files: they + don't belong to procfs and are governed by a completely + different API described in the Kernel API book. + + + + + + + + Managing procfs entries + + + This chapter describes the functions that various kernel + components use to populate the procfs with files, symlinks, + device nodes, and directories. + + + + A minor note before we start: if you want to use any of the + procfs functions, be sure to include the correct header file! + This should be one of the first lines in your code: + + + +#include <linux/proc_fs.h> + + + + + + + Creating a regular file + + + + struct proc_dir_entry* create_proc_entry + const char* name + mode_t mode + struct proc_dir_entry* parent + + + + + This function creates a regular file with the name + name, file mode + mode in the directory + parent. To create a file in the root of + the procfs, use NULL as + parent parameter. When successful, the + function will return a pointer to the freshly created + struct proc_dir_entry; otherwise it + will return NULL. describes how to do something useful with + regular files. + + + + Note that it is specifically supported that you can pass a + path that spans multiple directories. For example + create_proc_entry("drivers/via0/info") + will create the via0 + directory if necessary, with standard + 0755 permissions. + + + + If you only want to be able to read the file, the function + create_proc_read_entry described in may be used to create and initialise + the procfs entry in one single call. + + + + + + + + Creating a symlink + + + + struct proc_dir_entry* + proc_symlink const + char* name + struct proc_dir_entry* + parent const + char* dest + + + + + This creates a symlink in the procfs directory + parent that points from + name to + dest. This translates in userland to + ln -s dest + name. + + + + + Creating a directory + + + + struct proc_dir_entry* proc_mkdir + const char* name + struct proc_dir_entry* parent + + + + + Create a directory name in the procfs + directory parent. + + + + + + + + Removing an entry + + + + void remove_proc_entry + const char* name + struct proc_dir_entry* parent + + + + + Removes the entry name in the directory + parent from the procfs. Entries are + removed by their name, not by the + struct proc_dir_entry returned by the + various create functions. Note that this function doesn't + recursively remove entries. + + + + Be sure to free the data entry from + the struct proc_dir_entry before + remove_proc_entry is called (that is: if + there was some data allocated, of + course). See for more information + on using the data entry. + + + + + + + + + Communicating with userland + + + Instead of reading (or writing) information directly from + kernel memory, procfs works with call back + functions for files: functions that are called when + a specific file is being read or written. Such functions have + to be initialised after the procfs file is created by setting + the read_proc and/or + write_proc fields in the + struct proc_dir_entry* that the + function create_proc_entry returned: + + + +struct proc_dir_entry* entry; + +entry->read_proc = read_proc_foo; +entry->write_proc = write_proc_foo; + + + + If you only want to use a the + read_proc, the function + create_proc_read_entry described in may be used to create and initialise the + procfs entry in one single call. + + + + + + Reading data + + + The read function is a call back function that allows userland + processes to read data from the kernel. The read function + should have the following format: + + + + + int read_func + char* page + char** start + off_t off + int count + int* eof + void* data + + + + + The read function should write its information into the + page. For proper use, the function + should start writing at an offset of + off in page and + write at most count bytes, but because + most read functions are quite simple and only return a small + amount of information, these two parameters are usually + ignored (it breaks pagers like more and + less, but cat still + works). + + + + If the off and + count parameters are properly used, + eof should be used to signal that the + end of the file has been reached by writing + 1 to the memory location + eof points to. + + + + The parameter start doesn't seem to be + used anywhere in the kernel. The data + parameter can be used to create a single call back function for + several files, see . + + + + The read_func function must return the + number of bytes written into the page. + + + + shows how to use a read call back + function. + + + + + + + + Writing data + + + The write call back function allows a userland process to write + data to the kernel, so it has some kind of control over the + kernel. The write function should have the following format: + + + + + int write_func + struct file* file + const char* buffer + unsigned long count + void* data + + + + + The write function should read count + bytes at maximum from the buffer. Note + that the buffer doesn't live in the + kernel's memory space, so it should first be copied to kernel + space with copy_from_user. The + file parameter is usually + ignored. shows how to use the + data parameter. + + + + Again, shows how to use this call back + function. + + + + + + + + A single call back for many files + + + When a large number of almost identical files is used, it's + quite inconvenient to use a separate call back function for + each file. A better approach is to have a single call back + function that distinguishes between the files by using the + data field in struct + proc_dir_entry. First of all, the + data field has to be initialised: + + + +struct proc_dir_entry* entry; +struct my_file_data *file_data; + +file_data = kmalloc(sizeof(struct my_file_data), GFP_KERNEL); +entry->data = file_data; + + + + The data field is a void + *, so it can be initialised with anything. + + + + Now that the data field is set, the + read_proc and + write_proc can use it to distinguish + between files because they get it passed into their + data parameter: + + + +int foo_read_func(char *page, char **start, off_t off, + int count, int *eof, void *data) +{ + int len; + + if(data == file_data) { + /* special case for this file */ + } else { + /* normal processing */ + } + + return len; +} + + + + Be sure to free the data data field + when removing the procfs entry. + + + + + + + + + Tips and tricks + + + + + + Convenience functions + + + + struct proc_dir_entry* create_proc_read_entry + const char* name + mode_t mode + struct proc_dir_entry* parent + read_proc_t* read_proc + void* data + + + + + This function creates a regular file in exactly the same way + as create_proc_entry from does, but also allows to set the read + function read_proc in one call. This + function can set the data as well, like + explained in . + + + + + + + Modules + + + If procfs is being used from within a module, be sure to set + the owner field in the + struct proc_dir_entry to + THIS_MODULE. + + + +struct proc_dir_entry* entry; + +entry->owner = THIS_MODULE; + + + + + + + + Mode and ownership + + + Sometimes it is useful to change the mode and/or ownership of + a procfs entry. Here is an example that shows how to achieve + that: + + + +struct proc_dir_entry* entry; + +entry->mode = S_IWUSR |S_IRUSR | S_IRGRP | S_IROTH; +entry->uid = 0; +entry->gid = 100; + + + + + + + + + + Example + + + +&procfsexample; + + +
diff --git a/Documentation/DocBook/procfs_example.c b/Documentation/DocBook/procfs_example.c new file mode 100644 index 000000000000..7064084c1c5e --- /dev/null +++ b/Documentation/DocBook/procfs_example.c @@ -0,0 +1,224 @@ +/* + * procfs_example.c: an example proc interface + * + * Copyright (C) 2001, Erik Mouw (J.A.K.Mouw@its.tudelft.nl) + * + * This file accompanies the procfs-guide in the Linux kernel + * source. Its main use is to demonstrate the concepts and + * functions described in the guide. + * + * This software has been developed while working on the LART + * computing board (http://www.lart.tudelft.nl/), which is + * sponsored by the Mobile Multi-media Communications + * (http://www.mmc.tudelft.nl/) and Ubiquitous Communications + * (http://www.ubicom.tudelft.nl/) projects. + * + * The author can be reached at: + * + * Erik Mouw + * Information and Communication Theory Group + * Faculty of Information Technology and Systems + * Delft University of Technology + * P.O. Box 5031 + * 2600 GA Delft + * The Netherlands + * + * + * This program is free software; you can redistribute + * it and/or modify it under the terms of the GNU General + * Public License as published by the Free Software + * Foundation; either version 2 of the License, or (at your + * option) any later version. + * + * This program is distributed in the hope that it will be + * useful, but WITHOUT ANY WARRANTY; without even the implied + * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR + * PURPOSE. See the GNU General Public License for more + * details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place, + * Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include +#include +#include +#include +#include +#include + + +#define MODULE_VERS "1.0" +#define MODULE_NAME "procfs_example" + +#define FOOBAR_LEN 8 + +struct fb_data_t { + char name[FOOBAR_LEN + 1]; + char value[FOOBAR_LEN + 1]; +}; + + +static struct proc_dir_entry *example_dir, *foo_file, + *bar_file, *jiffies_file, *symlink; + + +struct fb_data_t foo_data, bar_data; + + +static int proc_read_jiffies(char *page, char **start, + off_t off, int count, + int *eof, void *data) +{ + int len; + + len = sprintf(page, "jiffies = %ld\n", + jiffies); + + return len; +} + + +static int proc_read_foobar(char *page, char **start, + off_t off, int count, + int *eof, void *data) +{ + int len; + struct fb_data_t *fb_data = (struct fb_data_t *)data; + + /* DON'T DO THAT - buffer overruns are bad */ + len = sprintf(page, "%s = '%s'\n", + fb_data->name, fb_data->value); + + return len; +} + + +static int proc_write_foobar(struct file *file, + const char *buffer, + unsigned long count, + void *data) +{ + int len; + struct fb_data_t *fb_data = (struct fb_data_t *)data; + + if(count > FOOBAR_LEN) + len = FOOBAR_LEN; + else + len = count; + + if(copy_from_user(fb_data->value, buffer, len)) + return -EFAULT; + + fb_data->value[len] = '\0'; + + return len; +} + + +static int __init init_procfs_example(void) +{ + int rv = 0; + + /* create directory */ + example_dir = proc_mkdir(MODULE_NAME, NULL); + if(example_dir == NULL) { + rv = -ENOMEM; + goto out; + } + + example_dir->owner = THIS_MODULE; + + /* create jiffies using convenience function */ + jiffies_file = create_proc_read_entry("jiffies", + 0444, example_dir, + proc_read_jiffies, + NULL); + if(jiffies_file == NULL) { + rv = -ENOMEM; + goto no_jiffies; + } + + jiffies_file->owner = THIS_MODULE; + + /* create foo and bar files using same callback + * functions + */ + foo_file = create_proc_entry("foo", 0644, example_dir); + if(foo_file == NULL) { + rv = -ENOMEM; + goto no_foo; + } + + strcpy(foo_data.name, "foo"); + strcpy(foo_data.value, "foo"); + foo_file->data = &foo_data; + foo_file->read_proc = proc_read_foobar; + foo_file->write_proc = proc_write_foobar; + foo_file->owner = THIS_MODULE; + + bar_file = create_proc_entry("bar", 0644, example_dir); + if(bar_file == NULL) { + rv = -ENOMEM; + goto no_bar; + } + + strcpy(bar_data.name, "bar"); + strcpy(bar_data.value, "bar"); + bar_file->data = &bar_data; + bar_file->read_proc = proc_read_foobar; + bar_file->write_proc = proc_write_foobar; + bar_file->owner = THIS_MODULE; + + /* create symlink */ + symlink = proc_symlink("jiffies_too", example_dir, + "jiffies"); + if(symlink == NULL) { + rv = -ENOMEM; + goto no_symlink; + } + + symlink->owner = THIS_MODULE; + + /* everything OK */ + printk(KERN_INFO "%s %s initialised\n", + MODULE_NAME, MODULE_VERS); + return 0; + +no_symlink: + remove_proc_entry("tty", example_dir); +no_tty: + remove_proc_entry("bar", example_dir); +no_bar: + remove_proc_entry("foo", example_dir); +no_foo: + remove_proc_entry("jiffies", example_dir); +no_jiffies: + remove_proc_entry(MODULE_NAME, NULL); +out: + return rv; +} + + +static void __exit cleanup_procfs_example(void) +{ + remove_proc_entry("jiffies_too", example_dir); + remove_proc_entry("tty", example_dir); + remove_proc_entry("bar", example_dir); + remove_proc_entry("foo", example_dir); + remove_proc_entry("jiffies", example_dir); + remove_proc_entry(MODULE_NAME, NULL); + + printk(KERN_INFO "%s %s removed\n", + MODULE_NAME, MODULE_VERS); +} + + +module_init(init_procfs_example); +module_exit(cleanup_procfs_example); + +MODULE_AUTHOR("Erik Mouw"); +MODULE_DESCRIPTION("procfs examples"); diff --git a/Documentation/DocBook/scsidrivers.tmpl b/Documentation/DocBook/scsidrivers.tmpl new file mode 100644 index 000000000000..d058e65daf19 --- /dev/null +++ b/Documentation/DocBook/scsidrivers.tmpl @@ -0,0 +1,193 @@ + + + + + + SCSI Subsystem Interfaces + + + + Douglas + Gilbert + +
+ dgilbert@interlog.com +
+
+
+
+ 2003-08-11 + + + 2002 + 2003 + Douglas Gilbert + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + + +
+ + + + + Introduction + +This document outlines the interface between the Linux scsi mid level +and lower level drivers. Lower level drivers are variously called HBA +(host bus adapter) drivers, host drivers (HD) or pseudo adapter drivers. +The latter alludes to the fact that a lower level driver may be a +bridge to another IO subsystem (and the "ide-scsi" driver is an example +of this). There can be many lower level drivers active in a running +system, but only one per hardware type. For example, the aic7xxx driver +controls adaptec controllers based on the 7xxx chip series. Most lower +level drivers can control one or more scsi hosts (a.k.a. scsi initiators). + + +This document can been found in an ASCII text file in the linux kernel +source: Documentation/scsi/scsi_mid_low_api.txt . +It currently hold a little more information than this document. The +drivers/scsi/hosts.h and +drivers/scsi/scsi.h headers contain descriptions of members +of important structures for the scsi subsystem. + + + + + Driver structure + +Traditionally a lower level driver for the scsi subsystem has been +at least two files in the drivers/scsi directory. For example, a +driver called "xyz" has a header file "xyz.h" and a source file +"xyz.c". [Actually there is no good reason why this couldn't all +be in one file.] Some drivers that have been ported to several operating +systems (e.g. aic7xxx which has separate files for generic and +OS-specific code) have more than two files. Such drivers tend to have +their own directory under the drivers/scsi directory. + + +scsi_module.c is normally included at the end of a lower +level driver. For it to work a declaration like this is needed before +it is included: + + static Scsi_Host_Template driver_template = DRIVER_TEMPLATE; + /* DRIVER_TEMPLATE should contain pointers to supported interface + functions. Scsi_Host_Template is defined hosts.h */ + #include "scsi_module.c" + + + +The scsi_module.c assumes the name "driver_template" is appropriately +defined. It contains 2 functions: + + + init_this_scsi_driver() called during builtin and module driver + initialization: invokes mid level's scsi_register_host() + + + exit_this_scsi_driver() called during closedown: invokes + mid level's scsi_unregister_host() + + + + +When a new, lower level driver is being added to Linux, the following +files (all found in the drivers/scsi directory) will need some attention: +Makefile, Config.help and Config.in . It is probably best to look at what +an existing lower level driver does in this regard. + + + + + Interface Functions +!EDocumentation/scsi/scsi_mid_low_api.txt + + + + Locks + +Each Scsi_Host instance has a spin_lock called Scsi_Host::default_lock +which is initialized in scsi_register() [found in hosts.c]. Within the +same function the Scsi_Host::host_lock pointer is initialized to point +at default_lock with the scsi_assign_lock() function. Thereafter +lock and unlock operations performed by the mid level use the +Scsi_Host::host_lock pointer. + + +Lower level drivers can override the use of Scsi_Host::default_lock by +using scsi_assign_lock(). The earliest opportunity to do this would +be in the detect() function after it has invoked scsi_register(). It +could be replaced by a coarser grain lock (e.g. per driver) or a +lock of equal granularity (i.e. per host). Using finer grain locks +(e.g. per scsi device) may be possible by juggling locks in +queuecommand(). + + + + + Changes since lk 2.4 series + +io_request_lock has been replaced by several finer grained locks. The lock +relevant to lower level drivers is Scsi_Host::host_lock and there is one +per scsi host. + + +The older error handling mechanism has been removed. This means the +lower level interface functions abort() and reset() have been removed. + + +In the 2.4 series the scsi subsystem configuration descriptions were +aggregated with the configuration descriptions from all other Linux +subsystems in the Documentation/Configure.help file. In the 2.5 series, +the scsi subsystem now has its own (much smaller) drivers/scsi/Config.help +file. + + + + + Credits + +The following people have contributed to this document: + + +Mike Anderson andmike@us.ibm.com + + +James Bottomley James.Bottomley@steeleye.com + + +Patrick Mansfield patmans@us.ibm.com + + + + + +
diff --git a/Documentation/DocBook/sis900.tmpl b/Documentation/DocBook/sis900.tmpl new file mode 100644 index 000000000000..6c2cbac93c3f --- /dev/null +++ b/Documentation/DocBook/sis900.tmpl @@ -0,0 +1,585 @@ + + + + + + + +SiS 900/7016 Fast Ethernet Device Driver + + + +Ollie +Lho + + + +Lei Chun +Chang + + + +Document Revision: 0.3 for SiS900 driver v1.06 & v1.07 +November 16, 2000 + + + 1999 + Silicon Integrated System Corp. + + + + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + + + + + +This document gives some information on installation and usage of SiS 900/7016 +device driver under Linux. + + + + + + + + + Introduction + + +This document describes the revision 1.06 and 1.07 of SiS 900/7016 Fast Ethernet +device driver under Linux. The driver is developed by Silicon Integrated +System Corp. and distributed freely under the GNU General Public License (GPL). +The driver can be compiled as a loadable module and used under Linux kernel +version 2.2.x. (rev. 1.06) +With minimal changes, the driver can also be used under 2.3.x and 2.4.x kernel +(rev. 1.07), please see +. If you are intended to +use the driver for earlier kernels, you are on your own. + + + +The driver is tested with usual TCP/IP applications including +FTP, Telnet, Netscape etc. and is used constantly by the developers. + + + +Please send all comments/fixes/questions to +Lei-Chun Chang. + + + + + Changes + + +Changes made in Revision 1.07 + + + + +Separation of sis900.c and sis900.h in order to move most +constant definition to sis900.h (many of those constants were +corrected) + + + + + +Clean up PCI detection, the pci-scan from Donald Becker were not used, +just simple pci_find_*. + + + + + +MII detection is modified to support multiple mii transceiver. + + + + + +Bugs in read_eeprom, mdio_* were removed. + + + + + +Lot of sis900 irrelevant comments were removed/changed and +more comments were added to reflect the real situation. + + + + + +Clean up of physical/virtual address space mess in buffer +descriptors. + + + + + +Better transmit/receive error handling. + + + + + +The driver now uses zero-copy single buffer management +scheme to improve performance. + + + + + +Names of variables were changed to be more consistent. + + + + + +Clean up of auo-negotiation and timer code. + + + + + +Automatic detection and change of PHY on the fly. + + + + + +Bug in mac probing fixed. + + + + + +Fix 630E equalier problem by modifying the equalizer workaround rule. + + + + + +Support for ICS1893 10/100 Interated PHYceiver. + + + + + +Support for media select by ifconfig. + + + + + +Added kernel-doc extratable documentation. + + + + + + + + + Tested Environment + + +This driver is developed on the following hardware + + + + + +Intel Celeron 500 with SiS 630 (rev 02) chipset + + + + + +SiS 900 (rev 01) and SiS 7016/7014 Fast Ethernet Card + + + + + +and tested with these software environments + + + + + +Red Hat Linux version 6.2 + + + + + +Linux kernel version 2.4.0 + + + + + +Netscape version 4.6 + + + + + +NcFTP 3.0.0 beta 18 + + + + + +Samba version 2.0.3 + + + + + + + + + + +Files in This Package + + +In the package you can find these files: + + + + + + +sis900.c + + +Driver source file in C + + + + + +sis900.h + + +Header file for sis900.c + + + + + +sis900.sgml + + +DocBook SGML source of the document + + + + + +sis900.txt + + +Driver document in plain text + + + + + + + + + + Installation + + +Silicon Integrated System Corp. is cooperating closely with core Linux Kernel +developers. The revisions of SiS 900 driver are distributed by the usuall channels +for kernel tar files and patches. Those kernel tar files for official kernel and +patches for kernel pre-release can be download at +official kernel ftp site +and its mirrors. +The 1.06 revision can be found in kernel version later than 2.3.15 and pre-2.2.14, +and 1.07 revision can be found in kernel version 2.4.0. +If you have no prior experience in networking under Linux, please read +Ethernet HOWTO and +Networking HOWTO available from +Linux Documentation Project (LDP). + + + +The driver is bundled in release later than 2.2.11 and 2.3.15 so this +is the most easy case. +Be sure you have the appropriate packages for compiling kernel source. +Those packages are listed in Document/Changes in kernel source +distribution. If you have to install the driver other than those bundled +in kernel release, you should have your driver file +sis900.c and sis900.h +copied into /usr/src/linux/drivers/net/ first. +There are two alternative ways to install the driver + + + +Building the driver as loadable module + + +To build the driver as a loadable kernel module you have to reconfigure +the kernel to activate network support by + + + +make menuconfig + + + +Choose Loadable module support --->, +then select Enable loadable module support. + + + +Choose Network Device Support --->, select +Ethernet (10 or 100Mbit). +Then select EISA, VLB, PCI and on board controllers, +and choose SiS 900/7016 PCI Fast Ethernet Adapter support +to M. + + + +After reconfiguring the kernel, you can make the driver module by + + + +make modules + + + +The driver should be compiled with no errors. After compiling the driver, +the driver can be installed to proper place by + + + +make modules_install + + + +Load the driver into kernel by + + + +insmod sis900 + + + +When loading the driver into memory, some information message can be view by + + + + +dmesg + + +or + + +cat /var/log/message + + + + +If the driver is loaded properly you will have messages similar to this: + + + +sis900.c: v1.07.06 11/07/2000 +eth0: SiS 900 PCI Fast Ethernet at 0xd000, IRQ 10, 00:00:e8:83:7f:a4. +eth0: SiS 900 Internal MII PHY transceiver found at address 1. +eth0: Using SiS 900 Internal MII PHY as default + + + +showing the version of the driver and the results of probing routine. + + + +Once the driver is loaded, network can be brought up by + + + +/sbin/ifconfig eth0 IPADDR broadcast BROADCAST netmask NETMASK media TYPE + + + +where IPADDR, BROADCAST, NETMASK are your IP address, broadcast address and +netmask respectively. TYPE is used to set medium type used by the device. +Typical values are "10baseT"(twisted-pair 10Mbps Ethernet) or "100baseT" +(twisted-pair 100Mbps Ethernet). For more information on how to configure +network interface, please refer to +Networking HOWTO. + + + +The link status is also shown by kernel messages. For example, after the +network interface is activated, you may have the message: + + + +eth0: Media Link On 100mbps full-duplex + + + +If you try to unplug the twist pair (TP) cable you will get + + + +eth0: Media Link Off + + + +indicating that the link is failed. + + + + +Building the driver into kernel + + +If you want to make the driver into kernel, choose Y +rather than M on +SiS 900/7016 PCI Fast Ethernet Adapter support +when configuring the kernel. Build the kernel image in the usual way + + + +make clean + +make bzlilo + + + +Next time the system reboot, you have the driver in memory. + + + + + + + Known Problems and Bugs + + +There are some known problems and bugs. If you find any other bugs please +mail to lcchang@sis.com.tw + + + + + +AM79C901 HomePNA PHY is not thoroughly tested, there may be some +bugs in the on the fly change of transceiver. + + + + + +A bug is hidden somewhere in the receive buffer management code, +the bug causes NULL pointer reference in the kernel. This fault is +caught before bad things happen and reported with the message: + + +eth0: NULL pointer encountered in Rx ring, skipping + + +which can be viewed with dmesg or +cat /var/log/message. + + + + + +The media type change from 10Mbps to 100Mbps twisted-pair ethernet +by ifconfig causes the media link down. + + + + + + + + + Revision History + + + + + + +November 13, 2000, Revision 1.07, seventh release, 630E problem fixed +and further clean up. + + + + + +November 4, 1999, Revision 1.06, Second release, lots of clean up +and optimization. + + + + + +August 8, 1999, Revision 1.05, Initial Public Release + + + + + + + + + Acknowledgements + + +This driver was originally derived form +Donald Becker's +pci-skeleton and +rtl8139 drivers. Donald also provided various suggestion +regarded with improvements made in revision 1.06. + + + +The 1.05 revision was created by +Jim Huang, AMD 79c901 +support was added by Chin-Shan Li. + + + + +List of Functions +!Idrivers/net/sis900.c + + + diff --git a/Documentation/DocBook/tulip-user.tmpl b/Documentation/DocBook/tulip-user.tmpl new file mode 100644 index 000000000000..6520d7a1b132 --- /dev/null +++ b/Documentation/DocBook/tulip-user.tmpl @@ -0,0 +1,327 @@ + + + + + + Tulip Driver User's Guide + + + + Jeff + Garzik + +
+ jgarzik@pobox.com +
+
+
+
+ + + 2001 + Jeff Garzik + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + +
+ + + + + Introduction + +The Tulip Ethernet Card Driver +is maintained by Jeff Garzik (jgarzik@pobox.com). + + + +The Tulip driver was developed by Donald Becker and changed by +Jeff Garzik, Takashi Manabe and a cast of thousands. + + + +For 2.4.x and later kernels, the Linux Tulip driver is available at +http://sourceforge.net/projects/tulip/ + + + + This driver is for the Digital "Tulip" Ethernet adapter interface. + It should work with most DEC 21*4*-based chips/ethercards, as well as + with work-alike chips from Lite-On (PNIC) and Macronix (MXIC) and ASIX. + + + + The original author may be reached as becker@scyld.com, or C/O + Scyld Computing Corporation, + 410 Severn Ave., Suite 210, + Annapolis MD 21403 + + + + Additional information on Donald Becker's tulip.c + is available at http://www.scyld.com/network/tulip.html + + + + + + Driver Compatibility + + +This device driver is designed for the DECchip "Tulip", Digital's +single-chip ethernet controllers for PCI (now owned by Intel). +Supported members of the family +are the 21040, 21041, 21140, 21140A, 21142, and 21143. Similar work-alike +chips from Lite-On, Macronics, ASIX, Compex and other listed below are also +supported. + + + +These chips are used on at least 140 unique PCI board designs. The great +number of chips and board designs supported is the reason for the +driver size and complexity. Almost of the increasing complexity is in the +board configuration and media selection code. There is very little +increasing in the operational critical path length. + + + + + Board-specific Settings + + +PCI bus devices are configured by the system at boot time, so no jumpers +need to be set on the board. The system BIOS preferably should assign the +PCI INTA signal to an otherwise unused system IRQ line. + + + +Some boards have EEPROMs tables with default media entry. The factory default +is usually "autoselect". This should only be overridden when using +transceiver connections without link beat e.g. 10base2 or AUI, or (rarely!) +for forcing full-duplex when used with old link partners that do not do +autonegotiation. + + + + + Driver Operation + +Ring buffers + + +The Tulip can use either ring buffers or lists of Tx and Rx descriptors. +This driver uses statically allocated rings of Rx and Tx descriptors, set at +compile time by RX/TX_RING_SIZE. This version of the driver allocates skbuffs +for the Rx ring buffers at open() time and passes the skb->data field to the +Tulip as receive data buffers. When an incoming frame is less than +RX_COPYBREAK bytes long, a fresh skbuff is allocated and the frame is +copied to the new skbuff. When the incoming frame is larger, the skbuff is +passed directly up the protocol stack and replaced by a newly allocated +skbuff. + + + +The RX_COPYBREAK value is chosen to trade-off the memory wasted by +using a full-sized skbuff for small frames vs. the copying costs of larger +frames. For small frames the copying cost is negligible (esp. considering +that we are pre-loading the cache with immediately useful header +information). For large frames the copying cost is non-trivial, and the +larger copy might flush the cache of useful data. A subtle aspect of this +choice is that the Tulip only receives into longword aligned buffers, thus +the IP header at offset 14 isn't longword aligned for further processing. +Copied frames are put into the new skbuff at an offset of "+2", thus copying +has the beneficial effect of aligning the IP header and preloading the +cache. + + + + +Synchronization + +The driver runs as two independent, single-threaded flows of control. One +is the send-packet routine, which enforces single-threaded use by the +dev->tbusy flag. The other thread is the interrupt handler, which is single +threaded by the hardware and other software. + + + +The send packet thread has partial control over the Tx ring and 'dev->tbusy' +flag. It sets the tbusy flag whenever it's queuing a Tx packet. If the next +queue slot is empty, it clears the tbusy flag when finished otherwise it sets +the 'tp->tx_full' flag. + + + +The interrupt handler has exclusive control over the Rx ring and records stats +from the Tx ring. (The Tx-done interrupt can't be selectively turned off, so +we can't avoid the interrupt overhead by having the Tx routine reap the Tx +stats.) After reaping the stats, it marks the queue entry as empty by setting +the 'base' to zero. Iff the 'tp->tx_full' flag is set, it clears both the +tx_full and tbusy flags. + + + + + + + + Errata + + +The old DEC databooks were light on details. +The 21040 databook claims that CSR13, CSR14, and CSR15 should each be the last +register of the set CSR12-15 written. Hmmm, now how is that possible? + + + +The DEC SROM format is very badly designed not precisely defined, leading to +part of the media selection junkheap below. Some boards do not have EEPROM +media tables and need to be patched up. Worse, other boards use the DEC +design kit media table when it isn't correct for their board. + + + +We cannot use MII interrupts because there is no defined GPIO pin to attach +them. The MII transceiver status is polled using an kernel timer. + + + + + Driver Change History + + Version 0.9.14 (February 20, 2001) + + Fix PNIC problems (Manfred Spraul) + Add new PCI id for Accton comet + Support Davicom tulips + Fix oops in eeprom parsing + Enable workarounds for early PCI chipsets + IA64, hppa csr0 support + Support media types 5, 6 + Interpret a bit more of the 21142 SROM extended media type 3 + Add missing delay in eeprom reading + + + + Version 0.9.11 (November 3, 2000) + + Eliminate extra bus accesses when sharing interrupts (prumpf) + Barrier following ownership descriptor bit flip (prumpf) + Endianness fixes for >14 addresses in setup frames (prumpf) + Report link beat to kernel/userspace via netif_carrier_*. (kuznet) + Better spinlocking in set_rx_mode. + Fix I/O resource request failure error messages (DaveM catch) + Handle DMA allocation failure. + + + + Version 0.9.10 (September 6, 2000) + + Simple interrupt mitigation (via jamal) + More PCI ids + + + + Version 0.9.9 (August 11, 2000) + + More PCI ids + + + + Version 0.9.8 (July 13, 2000) + + Correct signed/unsigned comparison for dummy frame index + Remove outdated references to struct enet_statistics + + + + Version 0.9.7 (June 17, 2000) + + Timer cleanups (Andrew Morton) + Alpha compile fix (somebody?) + + + + Version 0.9.6 (May 31, 2000) + + Revert 21143-related support flag patch + Add HPPA/media-table debugging printk + + + + Version 0.9.5 (May 30, 2000) + + HPPA support (willy@puffingroup) + CSR6 bits and tulip.h cleanup (Chris Smith) + Improve debugging messages a bit + Add delay after CSR13 write in t21142_start_nway + Remove unused ETHER_STATS code + Convert 'extern inline' to 'static inline' in tulip.h (Chris Smith) + Update DS21143 support flags in tulip_chip_info[] + Use spin_lock_irq, not _irqsave/restore, in tulip_start_xmit() + Add locking to set_rx_mode() + Fix race with chip setting DescOwned bit (Hal Murray) + Request 100% of PIO and MMIO resource space assigned to card + Remove error message from pci_enable_device failure + + + + Version 0.9.4.3 (April 14, 2000) + + mod_timer fix (Hal Murray) + PNIC2 resuscitation (Chris Smith) + + + + Version 0.9.4.2 (March 21, 2000) + + Fix 21041 CSR7, CSR13/14/15 handling + Merge some PCI ids from tulip 0.91x + Merge some HAS_xxx flags and flag settings from tulip 0.91x + asm/io.h fix (submitted by many) and cleanup + s/HAS_NWAY143/HAS_NWAY/ + Cleanup 21041 mode reporting + Small code cleanups + + + + Version 0.9.4.1 (March 18, 2000) + + Finish PCI DMA conversion (davem) + Do not netif_start_queue() at end of tulip_tx_timeout() (kuznet) + PCI DMA fix (kuznet) + eeprom.c code cleanup + Remove Xircom Tulip crud + + + + +
diff --git a/Documentation/DocBook/usb.tmpl b/Documentation/DocBook/usb.tmpl new file mode 100644 index 000000000000..f3ef0bf435e9 --- /dev/null +++ b/Documentation/DocBook/usb.tmpl @@ -0,0 +1,979 @@ + + + + + + The Linux-USB Host Side API + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + + + + + + + Introduction to USB on Linux + + A Universal Serial Bus (USB) is used to connect a host, + such as a PC or workstation, to a number of peripheral + devices. USB uses a tree structure, with the host at the + root (the system's master), hubs as interior nodes, and + peripheral devices as leaves (and slaves). + Modern PCs support several such trees of USB devices, usually + one USB 2.0 tree (480 Mbit/sec each) with + a few USB 1.1 trees (12 Mbit/sec each) that are used when you + connect a USB 1.1 device directly to the machine's "root hub". + + + That master/slave asymmetry was designed in part for + ease of use. It is not physically possible to assemble + (legal) USB cables incorrectly: all upstream "to-the-host" + connectors are the rectangular type, matching the sockets on + root hubs, and the downstream type are the squarish type + (or they are built in to the peripheral). + Software doesn't need to deal with distributed autoconfiguration + since the pre-designated master node manages all that. + At the electrical level, bus protocol overhead is reduced by + eliminating arbitration and moving scheduling into host software. + + + USB 1.0 was announced in January 1996, and was revised + as USB 1.1 (with improvements in hub specification and + support for interrupt-out transfers) in September 1998. + USB 2.0 was released in April 2000, including high speed + transfers and transaction translating hubs (used for USB 1.1 + and 1.0 backward compatibility). + + + USB support was added to Linux early in the 2.2 kernel series + shortly before the 2.3 development forked off. Updates + from 2.3 were regularly folded back into 2.2 releases, bringing + new features such as /sbin/hotplug support, + more drivers, and more robustness. + The 2.5 kernel series continued such improvements, and also + worked on USB 2.0 support, + higher performance, + better consistency between host controller drivers, + API simplification (to make bugs less likely), + and providing internal "kerneldoc" documentation. + + + Linux can run inside USB devices as well as on + the hosts that control the devices. + Because the Linux 2.x USB support evolved to support mass market + platforms such as Apple Macintosh or PC-compatible systems, + it didn't address design concerns for those types of USB systems. + So it can't be used inside mass-market PDAs, or other peripherals. + USB device drivers running inside those Linux peripherals + don't do the same things as the ones running inside hosts, + and so they've been given a different name: + they're called gadget drivers. + This document does not present gadget drivers. + + + + + + USB Host-Side API Model + + Within the kernel, + host-side drivers for USB devices talk to the "usbcore" APIs. + There are two types of public "usbcore" APIs, targetted at two different + layers of USB driver. Those are + general purpose drivers, exposed through + driver frameworks such as block, character, or network devices; + and drivers that are part of the core, + which are involved in managing a USB bus. + Such core drivers include the hub driver, + which manages trees of USB devices, and several different kinds + of host controller driver (HCD), + which control individual busses. + + + The device model seen by USB drivers is relatively complex. + + + + + USB supports four kinds of data transfer + (control, bulk, interrupt, and isochronous). Two transfer + types use bandwidth as it's available (control and bulk), + while the other two types of transfer (interrupt and isochronous) + are scheduled to provide guaranteed bandwidth. + + + The device description model includes one or more + "configurations" per device, only one of which is active at a time. + Devices that are capable of high speed operation must also support + full speed configurations, along with a way to ask about the + "other speed" configurations that might be used. + + + Configurations have one or more "interface", each + of which may have "alternate settings". Interfaces may be + standardized by USB "Class" specifications, or may be specific to + a vendor or device. + + USB device drivers actually bind to interfaces, not devices. + Think of them as "interface drivers", though you + may not see many devices where the distinction is important. + Most USB devices are simple, with only one configuration, + one interface, and one alternate setting. + + + Interfaces have one or more "endpoints", each of + which supports one type and direction of data transfer such as + "bulk out" or "interrupt in". The entire configuration may have + up to sixteen endpoints in each direction, allocated as needed + among all the interfaces. + + + Data transfer on USB is packetized; each endpoint + has a maximum packet size. + Drivers must often be aware of conventions such as flagging the end + of bulk transfers using "short" (including zero length) packets. + + + The Linux USB API supports synchronous calls for + control and bulk messaging. + It also supports asynchnous calls for all kinds of data transfer, + using request structures called "URBs" (USB Request Blocks). + + + + + Accordingly, the USB Core API exposed to device drivers + covers quite a lot of territory. You'll probably need to consult + the USB 2.0 specification, available online from www.usb.org at + no cost, as well as class or device specifications. + + + The only host-side drivers that actually touch hardware + (reading/writing registers, handling IRQs, and so on) are the HCDs. + In theory, all HCDs provide the same functionality through the same + API. In practice, that's becoming more true on the 2.5 kernels, + but there are still differences that crop up especially with + fault handling. Different controllers don't necessarily report + the same aspects of failures, and recovery from faults (including + software-induced ones like unlinking an URB) isn't yet fully + consistent. + Device driver authors should make a point of doing disconnect + testing (while the device is active) with each different host + controller driver, to make sure drivers don't have bugs of + their own as well as to make sure they aren't relying on some + HCD-specific behavior. + (You will need external USB 1.1 and/or + USB 2.0 hubs to perform all those tests.) + + + + +USB-Standard Types + + In <linux/usb_ch9.h> you will find + the USB data types defined in chapter 9 of the USB specification. + These data types are used throughout USB, and in APIs including + this host side API, gadget APIs, and usbfs. + + +!Iinclude/linux/usb_ch9.h + + + +Host-Side Data Types and Macros + + The host side API exposes several layers to drivers, some of + which are more necessary than others. + These support lifecycle models for host side drivers + and devices, and support passing buffers through usbcore to + some HCD that performs the I/O for the device driver. + + + +!Iinclude/linux/usb.h + + + + USB Core APIs + + There are two basic I/O models in the USB API. + The most elemental one is asynchronous: drivers submit requests + in the form of an URB, and the URB's completion callback + handle the next step. + All USB transfer types support that model, although there + are special cases for control URBs (which always have setup + and status stages, but may not have a data stage) and + isochronous URBs (which allow large packets and include + per-packet fault reports). + Built on top of that is synchronous API support, where a + driver calls a routine that allocates one or more URBs, + submits them, and waits until they complete. + There are synchronous wrappers for single-buffer control + and bulk transfers (which are awkward to use in some + driver disconnect scenarios), and for scatterlist based + streaming i/o (bulk or interrupt). + + + USB drivers need to provide buffers that can be + used for DMA, although they don't necessarily need to + provide the DMA mapping themselves. + There are APIs to use used when allocating DMA buffers, + which can prevent use of bounce buffers on some systems. + In some cases, drivers may be able to rely on 64bit DMA + to eliminate another kind of bounce buffer. + + +!Edrivers/usb/core/urb.c +!Edrivers/usb/core/message.c +!Edrivers/usb/core/file.c +!Edrivers/usb/core/usb.c +!Edrivers/usb/core/hub.c + + + Host Controller APIs + + These APIs are only for use by host controller drivers, + most of which implement standard register interfaces such as + EHCI, OHCI, or UHCI. + UHCI was one of the first interfaces, designed by Intel and + also used by VIA; it doesn't do much in hardware. + OHCI was designed later, to have the hardware do more work + (bigger transfers, tracking protocol state, and so on). + EHCI was designed with USB 2.0; its design has features that + resemble OHCI (hardware does much more work) as well as + UHCI (some parts of ISO support, TD list processing). + + + There are host controllers other than the "big three", + although most PCI based controllers (and a few non-PCI based + ones) use one of those interfaces. + Not all host controllers use DMA; some use PIO, and there + is also a simulator. + + + The same basic APIs are available to drivers for all + those controllers. + For historical reasons they are in two layers: + struct usb_bus is a rather thin + layer that became available in the 2.2 kernels, while + struct usb_hcd is a more featureful + layer (available in later 2.4 kernels and in 2.5) that + lets HCDs share common code, to shrink driver size + and significantly reduce hcd-specific behaviors. + + +!Edrivers/usb/core/hcd.c +!Edrivers/usb/core/hcd-pci.c +!Edrivers/usb/core/buffer.c + + + + The USB Filesystem (usbfs) + + This chapter presents the Linux usbfs. + You may prefer to avoid writing new kernel code for your + USB driver; that's the problem that usbfs set out to solve. + User mode device drivers are usually packaged as applications + or libraries, and may use usbfs through some programming library + that wraps it. Such libraries include + libusb + for C/C++, and + jUSB for Java. + + + Unfinished + This particular documentation is incomplete, + especially with respect to the asynchronous mode. + As of kernel 2.5.66 the code and this (new) documentation + need to be cross-reviewed. + + + + Configure usbfs into Linux kernels by enabling the + USB filesystem option (CONFIG_USB_DEVICEFS), + and you get basic support for user mode USB device drivers. + Until relatively recently it was often (confusingly) called + usbdevfs although it wasn't solving what + devfs was. + Every USB device will appear in usbfs, regardless of whether or + not it has a kernel driver; but only devices with kernel drivers + show up in devfs. + + + + What files are in "usbfs"? + + Conventionally mounted at + /proc/bus/usb, usbfs + features include: + + /proc/bus/usb/devices + ... a text file + showing each of the USB devices on known to the kernel, + and their configuration descriptors. + You can also poll() this to learn about new devices. + + /proc/bus/usb/BBB/DDD + ... magic files + exposing the each device's configuration descriptors, and + supporting a series of ioctls for making device requests, + including I/O to devices. (Purely for access by programs.) + + + + + Each bus is given a number (BBB) based on when it was + enumerated; within each bus, each device is given a similar + number (DDD). + Those BBB/DDD paths are not "stable" identifiers; + expect them to change even if you always leave the devices + plugged in to the same hub port. + Don't even think of saving these in application + configuration files. + Stable identifiers are available, for user mode applications + that want to use them. HID and networking devices expose + these stable IDs, so that for example you can be sure that + you told the right UPS to power down its second server. + "usbfs" doesn't (yet) expose those IDs. + + + + + + Mounting and Access Control + + There are a number of mount options for usbfs, which will + be of most interest to you if you need to override the default + access control policy. + That policy is that only root may read or write device files + (/proc/bus/BBB/DDD) although anyone may read + the devices + or drivers files. + I/O requests to the device also need the CAP_SYS_RAWIO capability, + + + The significance of that is that by default, all user mode + device drivers need super-user privileges. + You can change modes or ownership in a driver setup + when the device hotplugs, or maye just start the + driver right then, as a privileged server (or some activity + within one). + That's the most secure approach for multi-user systems, + but for single user systems ("trusted" by that user) + it's more convenient just to grant everyone all access + (using the devmode=0666 option) + so the driver can start whenever it's needed. + + + The mount options for usbfs, usable in /etc/fstab or + in command line invocations of mount, are: + + + + busgid=NNNNN + Controls the GID used for the + /proc/bus/usb/BBB + directories. (Default: 0) + busmode=MMM + Controls the file mode used for the + /proc/bus/usb/BBB + directories. (Default: 0555) + + busuid=NNNNN + Controls the UID used for the + /proc/bus/usb/BBB + directories. (Default: 0) + + devgid=NNNNN + Controls the GID used for the + /proc/bus/usb/BBB/DDD + files. (Default: 0) + devmode=MMM + Controls the file mode used for the + /proc/bus/usb/BBB/DDD + files. (Default: 0644) + devuid=NNNNN + Controls the UID used for the + /proc/bus/usb/BBB/DDD + files. (Default: 0) + + listgid=NNNNN + Controls the GID used for the + /proc/bus/usb/devices and drivers files. + (Default: 0) + listmode=MMM + Controls the file mode used for the + /proc/bus/usb/devices and drivers files. + (Default: 0444) + listuid=NNNNN + Controls the UID used for the + /proc/bus/usb/devices and drivers files. + (Default: 0) + + + + + Note that many Linux distributions hard-wire the mount options + for usbfs in their init scripts, such as + /etc/rc.d/rc.sysinit, + rather than making it easy to set this per-system + policy in /etc/fstab. + + + + + + /proc/bus/usb/devices + + This file is handy for status viewing tools in user + mode, which can scan the text format and ignore most of it. + More detailed device status (including class and vendor + status) is available from device-specific files. + For information about the current format of this file, + see the + Documentation/usb/proc_usb_info.txt + file in your Linux kernel sources. + + + Otherwise the main use for this file from programs + is to poll() it to get notifications of usb devices + as they're plugged or unplugged. + To see what changed, you'd need to read the file and + compare "before" and "after" contents, scan the filesystem, + or see its hotplug event. + + + + + + /proc/bus/usb/BBB/DDD + + Use these files in one of these basic ways: + + + They can be read, + producing first the device descriptor + (18 bytes) and then the descriptors for the current configuration. + See the USB 2.0 spec for details about those binary data formats. + You'll need to convert most multibyte values from little endian + format to your native host byte order, although a few of the + fields in the device descriptor (both of the BCD-encoded fields, + and the vendor and product IDs) will be byteswapped for you. + Note that configuration descriptors include descriptors for + interfaces, altsettings, endpoints, and maybe additional + class descriptors. + + + Perform USB operations using + ioctl() requests to make endpoint I/O + requests (synchronously or asynchronously) or manage + the device. + These requests need the CAP_SYS_RAWIO capability, + as well as filesystem access permissions. + Only one ioctl request can be made on one of these + device files at a time. + This means that if you are synchronously reading an endpoint + from one thread, you won't be able to write to a different + endpoint from another thread until the read completes. + This works for half duplex protocols, + but otherwise you'd use asynchronous i/o requests. + + + + + + + Life Cycle of User Mode Drivers + + Such a driver first needs to find a device file + for a device it knows how to handle. + Maybe it was told about it because a + /sbin/hotplug event handling agent + chose that driver to handle the new device. + Or maybe it's an application that scans all the + /proc/bus/usb device files, and ignores most devices. + In either case, it should read() all + the descriptors from the device file, + and check them against what it knows how to handle. + It might just reject everything except a particular + vendor and product ID, or need a more complex policy. + + + Never assume there will only be one such device + on the system at a time! + If your code can't handle more than one device at + a time, at least detect when there's more than one, and + have your users choose which device to use. + + + Once your user mode driver knows what device to use, + it interacts with it in either of two styles. + The simple style is to make only control requests; some + devices don't need more complex interactions than those. + (An example might be software using vendor-specific control + requests for some initialization or configuration tasks, + with a kernel driver for the rest.) + + + More likely, you need a more complex style driver: + one using non-control endpoints, reading or writing data + and claiming exclusive use of an interface. + Bulk transfers are easiest to use, + but only their sibling interrupt transfers + work with low speed devices. + Both interrupt and isochronous transfers + offer service guarantees because their bandwidth is reserved. + Such "periodic" transfers are awkward to use through usbfs, + unless you're using the asynchronous calls. However, interrupt + transfers can also be used in a synchronous "one shot" style. + + + Your user-mode driver should never need to worry + about cleaning up request state when the device is + disconnected, although it should close its open file + descriptors as soon as it starts seeing the ENODEV + errors. + + + + + The ioctl() Requests + + To use these ioctls, you need to include the following + headers in your userspace program: +#include <linux/usb.h> +#include <linux/usbdevice_fs.h> +#include <asm/byteorder.h> + The standard USB device model requests, from "Chapter 9" of + the USB 2.0 specification, are automatically included from + the <linux/usb_ch9.h> header. + + + Unless noted otherwise, the ioctl requests + described here will + update the modification time on the usbfs file to which + they are applied (unless they fail). + A return of zero indicates success; otherwise, a + standard USB error code is returned. (These are + documented in + Documentation/usb/error-codes.txt + in your kernel sources.) + + + Each of these files multiplexes access to several + I/O streams, one per endpoint. + Each device has one control endpoint (endpoint zero) + which supports a limited RPC style RPC access. + Devices are configured + by khubd (in the kernel) setting a device-wide + configuration that affects things + like power consumption and basic functionality. + The endpoints are part of USB interfaces, + which may have altsettings + affecting things like which endpoints are available. + Many devices only have a single configuration and interface, + so drivers for them will ignore configurations and altsettings. + + + + + Management/Status Requests + + A number of usbfs requests don't deal very directly + with device I/O. + They mostly relate to device management and status. + These are all synchronous requests. + + + + + USBDEVFS_CLAIMINTERFACE + This is used to force usbfs to + claim a specific interface, + which has not previously been claimed by usbfs or any other + kernel driver. + The ioctl parameter is an integer holding the number of + the interface (bInterfaceNumber from descriptor). + + Note that if your driver doesn't claim an interface + before trying to use one of its endpoints, and no + other driver has bound to it, then the interface is + automatically claimed by usbfs. + + This claim will be released by a RELEASEINTERFACE ioctl, + or by closing the file descriptor. + File modification time is not updated by this request. + + + USBDEVFS_CONNECTINFO + Says whether the device is lowspeed. + The ioctl parameter points to a structure like this: +struct usbdevfs_connectinfo { + unsigned int devnum; + unsigned char slow; +}; + File modification time is not updated by this request. + + You can't tell whether a "not slow" + device is connected at high speed (480 MBit/sec) + or just full speed (12 MBit/sec). + You should know the devnum value already, + it's the DDD value of the device file name. + + + USBDEVFS_GETDRIVER + Returns the name of the kernel driver + bound to a given interface (a string). Parameter + is a pointer to this structure, which is modified: +struct usbdevfs_getdriver { + unsigned int interface; + char driver[USBDEVFS_MAXDRIVERNAME + 1]; +}; + File modification time is not updated by this request. + + + USBDEVFS_IOCTL + Passes a request from userspace through + to a kernel driver that has an ioctl entry in the + struct usb_driver it registered. +struct usbdevfs_ioctl { + int ifno; + int ioctl_code; + void *data; +}; + +/* user mode call looks like this. + * 'request' becomes the driver->ioctl() 'code' parameter. + * the size of 'param' is encoded in 'request', and that data + * is copied to or from the driver->ioctl() 'buf' parameter. + */ +static int +usbdev_ioctl (int fd, int ifno, unsigned request, void *param) +{ + struct usbdevfs_ioctl wrapper; + + wrapper.ifno = ifno; + wrapper.ioctl_code = request; + wrapper.data = param; + + return ioctl (fd, USBDEVFS_IOCTL, &wrapper); +} + File modification time is not updated by this request. + + This request lets kernel drivers talk to user mode code + through filesystem operations even when they don't create + a charactor or block special device. + It's also been used to do things like ask devices what + device special file should be used. + Two pre-defined ioctls are used + to disconnect and reconnect kernel drivers, so + that user mode code can completely manage binding + and configuration of devices. + + + USBDEVFS_RELEASEINTERFACE + This is used to release the claim usbfs + made on interface, either implicitly or because of a + USBDEVFS_CLAIMINTERFACE call, before the file + descriptor is closed. + The ioctl parameter is an integer holding the number of + the interface (bInterfaceNumber from descriptor); + File modification time is not updated by this request. + + No security check is made to ensure + that the task which made the claim is the one + which is releasing it. + This means that user mode driver may interfere + other ones. + + + USBDEVFS_RESETEP + Resets the data toggle value for an endpoint + (bulk or interrupt) to DATA0. + The ioctl parameter is an integer endpoint number + (1 to 15, as identified in the endpoint descriptor), + with USB_DIR_IN added if the device's endpoint sends + data to the host. + + Avoid using this request. + It should probably be removed. + Using it typically means the device and driver will lose + toggle synchronization. If you really lost synchronization, + you likely need to completely handshake with the device, + using a request like CLEAR_HALT + or SET_INTERFACE. + + + + + + + + Synchronous I/O Support + + Synchronous requests involve the kernel blocking + until until the user mode request completes, either by + finishing successfully or by reporting an error. + In most cases this is the simplest way to use usbfs, + although as noted above it does prevent performing I/O + to more than one endpoint at a time. + + + + + USBDEVFS_BULK + Issues a bulk read or write request to the + device. + The ioctl parameter is a pointer to this structure: +struct usbdevfs_bulktransfer { + unsigned int ep; + unsigned int len; + unsigned int timeout; /* in milliseconds */ + void *data; +}; + The "ep" value identifies a + bulk endpoint number (1 to 15, as identified in an endpoint + descriptor), + masked with USB_DIR_IN when referring to an endpoint which + sends data to the host from the device. + The length of the data buffer is identified by "len"; + Recent kernels support requests up to about 128KBytes. + FIXME say how read length is returned, + and how short reads are handled.. + + + USBDEVFS_CLEAR_HALT + Clears endpoint halt (stall) and + resets the endpoint toggle. This is only + meaningful for bulk or interrupt endpoints. + The ioctl parameter is an integer endpoint number + (1 to 15, as identified in an endpoint descriptor), + masked with USB_DIR_IN when referring to an endpoint which + sends data to the host from the device. + + Use this on bulk or interrupt endpoints which have + stalled, returning -EPIPE status + to a data transfer request. + Do not issue the control request directly, since + that could invalidate the host's record of the + data toggle. + + + USBDEVFS_CONTROL + Issues a control request to the device. + The ioctl parameter points to a structure like this: +struct usbdevfs_ctrltransfer { + __u8 bRequestType; + __u8 bRequest; + __u16 wValue; + __u16 wIndex; + __u16 wLength; + __u32 timeout; /* in milliseconds */ + void *data; +}; + + The first eight bytes of this structure are the contents + of the SETUP packet to be sent to the device; see the + USB 2.0 specification for details. + The bRequestType value is composed by combining a + USB_TYPE_* value, a USB_DIR_* value, and a + USB_RECIP_* value (from + <linux/usb.h>). + If wLength is nonzero, it describes the length of the data + buffer, which is either written to the device + (USB_DIR_OUT) or read from the device (USB_DIR_IN). + + At this writing, you can't transfer more than 4 KBytes + of data to or from a device; usbfs has a limit, and + some host controller drivers have a limit. + (That's not usually a problem.) + Also there's no way to say it's + not OK to get a short read back from the device. + + + USBDEVFS_RESET + Does a USB level device reset. + The ioctl parameter is ignored. + After the reset, this rebinds all device interfaces. + File modification time is not updated by this request. + + Avoid using this call + until some usbcore bugs get fixed, + since it does not fully synchronize device, interface, + and driver (not just usbfs) state. + + + USBDEVFS_SETINTERFACE + Sets the alternate setting for an + interface. The ioctl parameter is a pointer to a + structure like this: +struct usbdevfs_setinterface { + unsigned int interface; + unsigned int altsetting; +}; + File modification time is not updated by this request. + + Those struct members are from some interface descriptor + applying to the the current configuration. + The interface number is the bInterfaceNumber value, and + the altsetting number is the bAlternateSetting value. + (This resets each endpoint in the interface.) + + + USBDEVFS_SETCONFIGURATION + Issues the + usb_set_configuration call + for the device. + The parameter is an integer holding the number of + a configuration (bConfigurationValue from descriptor). + File modification time is not updated by this request. + + Avoid using this call + until some usbcore bugs get fixed, + since it does not fully synchronize device, interface, + and driver (not just usbfs) state. + + + + + + + Asynchronous I/O Support + + As mentioned above, there are situations where it may be + important to initiate concurrent operations from user mode code. + This is particularly important for periodic transfers + (interrupt and isochronous), but it can be used for other + kinds of USB requests too. + In such cases, the asynchronous requests described here + are essential. Rather than submitting one request and having + the kernel block until it completes, the blocking is separate. + + + These requests are packaged into a structure that + resembles the URB used by kernel device drivers. + (No POSIX Async I/O support here, sorry.) + It identifies the endpoint type (USBDEVFS_URB_TYPE_*), + endpoint (number, masked with USB_DIR_IN as appropriate), + buffer and length, and a user "context" value serving to + uniquely identify each request. + (It's usually a pointer to per-request data.) + Flags can modify requests (not as many as supported for + kernel drivers). + + + Each request can specify a realtime signal number + (between SIGRTMIN and SIGRTMAX, inclusive) to request a + signal be sent when the request completes. + + + When usbfs returns these urbs, the status value + is updated, and the buffer may have been modified. + Except for isochronous transfers, the actual_length is + updated to say how many bytes were transferred; if the + USBDEVFS_URB_DISABLE_SPD flag is set + ("short packets are not OK"), if fewer bytes were read + than were requested then you get an error report. + + +struct usbdevfs_iso_packet_desc { + unsigned int length; + unsigned int actual_length; + unsigned int status; +}; + +struct usbdevfs_urb { + unsigned char type; + unsigned char endpoint; + int status; + unsigned int flags; + void *buffer; + int buffer_length; + int actual_length; + int start_frame; + int number_of_packets; + int error_count; + unsigned int signr; + void *usercontext; + struct usbdevfs_iso_packet_desc iso_frame_desc[]; +}; + + For these asynchronous requests, the file modification + time reflects when the request was initiated. + This contrasts with their use with the synchronous requests, + where it reflects when requests complete. + + + + + USBDEVFS_DISCARDURB + + TBS + File modification time is not updated by this request. + + + + USBDEVFS_DISCSIGNAL + + TBS + File modification time is not updated by this request. + + + + USBDEVFS_REAPURB + + TBS + File modification time is not updated by this request. + + + + USBDEVFS_REAPURBNDELAY + + TBS + File modification time is not updated by this request. + + + + USBDEVFS_SUBMITURB + + TBS + + + + + + + + + + + + diff --git a/Documentation/DocBook/via-audio.tmpl b/Documentation/DocBook/via-audio.tmpl new file mode 100644 index 000000000000..36e642147d6b --- /dev/null +++ b/Documentation/DocBook/via-audio.tmpl @@ -0,0 +1,597 @@ + + + + + + Via 686 Audio Driver for Linux + + + + Jeff + Garzik + + + + + 1999-2001 + Jeff Garzik + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + + + + + + + Introduction + + The Via VT82C686A "super southbridge" chips contain + AC97-compatible audio logic which features dual 16-bit stereo + PCM sound channels (full duplex), plus a third PCM channel intended for use + in hardware-assisted FM synthesis. + + + The current Linux kernel audio driver for this family of chips + supports audio playback and recording, but hardware-assisted + FM features, and hardware buffer direct-access (mmap) + support are not yet available. + + + This driver supports any Linux kernel version after 2.4.10. + + + Please send bug reports to the mailing list linux-via@gtf.org. + To subscribe, e-mail majordomo@gtf.org with + + + subscribe linux-via + + + in the body of the message. + + + + + Driver Installation + + To use this audio driver, select the + CONFIG_SOUND_VIA82CXXX option in the section Sound during kernel configuration. + Follow the usual kernel procedures for rebuilding the kernel, + or building and installing driver modules. + + + To make this driver the default audio driver, you can add the + following to your /etc/conf.modules file: + + + alias sound via82cxxx_audio + + + Note that soundcore and ac97_codec support modules + are also required for working audio, in addition to + the via82cxxx_audio module itself. + + + + + Submitting a bug report + Description of problem + + Describe the application you were using to play/record sound, and how + to reproduce the problem. + + + Diagnostic output + + Obtain the via-audio-diag diagnostics program from + http://sf.net/projects/gkernel/ and provide a dump of the + audio chip's registers while the problem is occurring. Sample command line: + + + ./via-audio-diag -aps > diag-output.txt + + + Driver debug output + + Define VIA_DEBUG at the beginning of the driver, then capture and email + the kernel log output. This can be viewed in the system kernel log (if + enabled), or via the dmesg program. Sample command line: + + + dmesg > /tmp/dmesg-output.txt + + + Bigger kernel message buffer + + If you wish to increase the size of the buffer displayed by dmesg, then + change the LOG_BUF_LEN macro at the top of linux/kernel/printk.c, recompile + your kernel, and pass the LOG_BUF_LEN value to dmesg. Sample command line with + LOG_BUF_LEN == 32768: + + + dmesg -s 32768 > /tmp/dmesg-output.txt + + + + + + Known Bugs And Assumptions + + + Low volume + + + Volume too low on many systems. Workaround: use mixer program + such as xmixer to increase volume. + + + + + + + + + + Thanks + + Via for providing e-mail support, specs, and NDA'd source code. + + + MandrakeSoft for providing hacking time. + + + AC97 mixer interface fixes and debugging by Ron Cemer roncemer@gte.net. + + + Rui Sousa rui.sousa@conexant.com, for bugfixing + MMAP support, and several other notable fixes that resulted from + his hard work and testing. + + + Adrian Cox adrian@humboldt.co.uk, for bugfixing + MMAP support, and several other notable fixes that resulted from + his hard work and testing. + + + Thomas Sailer for further bugfixes. + + + + + Random Notes + + Two /proc pseudo-files provide diagnostic information. This is generally + not useful to most users. Power users can disable CONFIG_SOUND_VIA82CXXX_PROCFS, + and remove the /proc support code. Once + version 2.0.0 is released, the /proc support code will be disabled by + default. Available /proc pseudo-files: + + + /proc/driver/via/0/info + /proc/driver/via/0/ac97 + + + This driver by default supports all PCI audio devices which report + a vendor id of 0x1106, and a device id of 0x3058. Subsystem vendor + and device ids are not examined. + + + GNU indent formatting options: + +-kr -i8 -ts8 -br -ce -bap -sob -l80 -pcs -cs -ss -bs -di1 -nbc -lp -psl + + + + Via has graciously donated e-mail support and source code to help further + the development of this driver. Their assistance has been invaluable + in the design and coding of the next major version of this driver. + + + The Via audio chip apparently provides a second PCM scatter-gather + DMA channel just for FM data, but does not have a full hardware MIDI + processor. I haven't put much thought towards a solution here, but it + might involve using SoftOSS midi wave table, or simply disabling MIDI + support altogether and using the FM PCM channel as a second (input? output?) + + + + + Driver ChangeLog + + +Version 1.9.1 + + + + + DSP read/write bugfixes from Thomas Sailer. + + + + + + Add new PCI id for single-channel use of Via 8233. + + + + + + Other bug fixes, tweaks, new ioctls. + + + + + + + +Version 1.1.15 + + + + + Support for variable fragment size and variable fragment number (Rui + Sousa) + + + + + + Fixes for the SPEED, STEREO, CHANNELS, FMT ioctls when in read & + write mode (Rui Sousa) + + + + + + Mmaped sound is now fully functional. (Rui Sousa) + + + + + + Make sure to enable PCI device before reading any of its PCI + config information. (fixes potential hotplug problems) + + + + + + Clean up code a bit and add more internal function documentation. + + + + + + AC97 codec access fixes (Adrian Cox) + + + + + + Big endian fixes (Adrian Cox) + + + + + + MIDI support (Adrian Cox) + + + + + + Detect and report locked-rate AC97 codecs. If your hardware only + supports 48Khz (locked rate), then your recording/playback software + must upsample or downsample accordingly. The hardware cannot do it. + + + + + + Use new pci_request_regions and pci_disable_device functions in + kernel 2.4.6. + + + + + + + +Version 1.1.14 + + + + + Use VM_RESERVE when available, to eliminate unnecessary page faults. + + + + + + +Version 1.1.12 + + + + + mmap bug fixes from Linus. + + + + + + +Version 1.1.11 + + + + + Many more bug fixes. mmap enabled by default, but may still be buggy. + + + + + + Uses new and spiffy method of mmap'ing the DMA buffer, based + on a suggestion from Linus. + + + + + + +Version 1.1.10 + + + + + Many bug fixes. mmap enabled by default, but may still be buggy. + + + + + + +Version 1.1.9 + + + + + Redesign and rewrite audio playback implementation. (faster and smaller, hopefully) + + + + + + Implement recording and full duplex (DSP_CAP_DUPLEX) support. + + + + + + Make procfs support optional. + + + + + + Quick interrupt status check, to lessen overhead in interrupt + sharing situations. + + + + + + Add mmap(2) support. Disabled for now, it is still buggy and experimental. + + + + + + Surround all syscalls with a semaphore for cheap and easy SMP protection. + + + + + + Fix bug in channel shutdown (hardware channel reset) code. + + + + + + Remove unnecessary spinlocks (better performance). + + + + + + Eliminate "unknown AFMT" message by using a different method + of selecting the best AFMT_xxx sound sample format for use. + + + + + + Support for realtime hardware pointer position reporting + (DSP_CAP_REALTIME, SNDCTL_DSP_GETxPTR ioctls) + + + + + + Support for capture/playback triggering + (DSP_CAP_TRIGGER, SNDCTL_DSP_SETTRIGGER ioctls) + + + + + + SNDCTL_DSP_SETDUPLEX and SNDCTL_DSP_POST ioctls now handled. + + + + + + Rewrite open(2) and close(2) logic to allow only one user at + a time. All other open(2) attempts will sleep until they succeed. + FIXME: open(O_RDONLY) and open(O_WRONLY) should be allowed to succeed. + + + + + + Reviewed code to ensure that SMP and multiple audio devices + are fully supported. + + + + + + + +Version 1.1.8 + + + + + Clean up interrupt handler output. Fixes the following kernel error message: + + + unhandled interrupt ... + + + + + + Convert documentation to DocBook, so that PDF, HTML and PostScript (.ps) output is readily + available. + + + + + + + +Version 1.1.7 + + + + + Fix module unload bug where mixer device left registered + after driver exit + + + + + + +Version 1.1.6 + + + + + Rewrite via_set_rate to mimic ALSA basic AC97 rate setting + + + + + Remove much dead code + + + + + Complete spin_lock_irqsave -> spin_lock_irq conversion in via_dsp_ioctl + + + + + Fix build problem in via_dsp_ioctl + + + + + Optimize included headers to eliminate headers found in linux/sound + + + + + + +Version 1.1.5 + + + + + Disable some overly-verbose debugging code + + + + + Remove unnecessary sound locks + + + + + Fix some ioctls for better time resolution + + + + + Begin spin_lock_irqsave -> spin_lock_irq conversion in via_dsp_ioctl + + + + + + +Version 1.1.4 + + + + + Completed rewrite of driver. Eliminated SoundBlaster compatibility + completely, and now uses the much-faster scatter-gather DMA engine. + + + + + + + + + Internal Functions +!Isound/oss/via82cxxx_audio.c + + + + + diff --git a/Documentation/DocBook/videobook.tmpl b/Documentation/DocBook/videobook.tmpl new file mode 100644 index 000000000000..3ec6c875588a --- /dev/null +++ b/Documentation/DocBook/videobook.tmpl @@ -0,0 +1,1663 @@ + + + + + + Video4Linux Programming + + + + Alan + Cox + +
+ alan@redhat.com +
+
+
+
+ + + 2000 + Alan Cox + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + +
+ + + + + Introduction + + Parts of this document first appeared in Linux Magazine under a + ninety day exclusivity. + + + Video4Linux is intended to provide a common programming interface + for the many TV and capture cards now on the market, as well as + parallel port and USB video cameras. Radio, teletext decoders and + vertical blanking data interfaces are also provided. + + + + Radio Devices + + There are a wide variety of radio interfaces available for PC's, and these + are generally very simple to program. The biggest problem with supporting + such devices is normally extracting documentation from the vendor. + + + The radio interface supports a simple set of control ioctls standardised + across all radio and tv interfaces. It does not support read or write, which + are used for video streams. The reason radio cards do not allow you to read + the audio stream into an application is that without exception they provide + a connection on to a soundcard. Soundcards can be used to read the radio + data just fine. + + + Registering Radio Devices + + The Video4linux core provides an interface for registering devices. The + first step in writing our radio card driver is to register it. + + + + +static struct video_device my_radio +{ + "My radio", + VID_TYPE_TUNER, + VID_HARDWARE_MYRADIO, + radio_open. + radio_close, + NULL, /* no read */ + NULL, /* no write */ + NULL, /* no poll */ + radio_ioctl, + NULL, /* no special init function */ + NULL /* no private data */ +}; + + + + + This declares our video4linux device driver interface. The VID_TYPE_ value + defines what kind of an interface we are, and defines basic capabilities. + + + The only defined value relevant for a radio card is VID_TYPE_TUNER which + indicates that the device can be tuned. Clearly our radio is going to have some + way to change channel so it is tuneable. + + + The VID_HARDWARE_ types are unique to each device. Numbers are assigned by + alan@redhat.com when device drivers are going to be released. Until then you + can pull a suitably large number out of your hat and use it. 10000 should be + safe for a very long time even allowing for the huge number of vendors + making new and different radio cards at the moment. + + + We declare an open and close routine, but we do not need read or write, + which are used to read and write video data to or from the card itself. As + we have no read or write there is no poll function. + + + The private initialise function is run when the device is registered. In + this driver we've already done all the work needed. The final pointer is a + private data pointer that can be used by the device driver to attach and + retrieve private data structures. We set this field "priv" to NULL for + the moment. + + + Having the structure defined is all very well but we now need to register it + with the kernel. + + + + +static int io = 0x320; + +int __init myradio_init(struct video_init *v) +{ + if(!request_region(io, MY_IO_SIZE, "myradio")) + { + printk(KERN_ERR + "myradio: port 0x%03X is in use.\n", io); + return -EBUSY; + } + + if(video_device_register(&my_radio, VFL_TYPE_RADIO)==-1) { + release_region(io, MY_IO_SIZE); + return -EINVAL; + } + return 0; +} + + + + The first stage of the initialisation, as is normally the case, is to check + that the I/O space we are about to fiddle with doesn't belong to some other + driver. If it is we leave well alone. If the user gives the address of the + wrong device then we will spot this. These policies will generally avoid + crashing the machine. + + + Now we ask the Video4Linux layer to register the device for us. We hand it + our carefully designed video_device structure and also tell it which group + of devices we want it registered with. In this case VFL_TYPE_RADIO. + + + The types available are + + Device Types + + + + VFL_TYPE_RADIO/dev/radio{n} + + Radio devices are assigned in this block. As with all of these + selections the actual number assignment is done by the video layer + accordijng to what is free. + + VFL_TYPE_GRABBER/dev/video{n} + Video capture devices and also -- counter-intuitively for the name -- + hardware video playback devices such as MPEG2 cards. + + VFL_TYPE_VBI/dev/vbi{n} + The VBI devices capture the hidden lines on a television picture + that carry further information like closed caption data, teletext + (primarily in Europe) and now Intercast and the ATVEC internet + television encodings. + + VFL_TYPE_VTX/dev/vtx[n} + VTX is 'Videotext' also known as 'Teletext'. This is a system for + sending numbered, 40x25, mostly textual page images over the hidden + lines. Unlike the /dev/vbi interfaces, this is for 'smart' decoder + chips. (The use of the word smart here has to be taken in context, + the smartest teletext chips are fairly dumb pieces of technology). + + + + +
+ + We are most definitely a radio. + + + Finally we allocate our I/O space so that nobody treads on us and return 0 + to signify general happiness with the state of the universe. + +
+ + Opening And Closing The Radio + + + The functions we declared in our video_device are mostly very simple. + Firstly we can drop in what is basically standard code for open and close. + + + + +static int users = 0; + +static int radio_open(stuct video_device *dev, int flags) +{ + if(users) + return -EBUSY; + users++; + return 0; +} + + + + At open time we need to do nothing but check if someone else is also using + the radio card. If nobody is using it we make a note that we are using it, + then we ensure that nobody unloads our driver on us. + + + + +static int radio_close(struct video_device *dev) +{ + users--; +} + + + + At close time we simply need to reduce the user count and allow the module + to become unloadable. + + + If you are sharp you will have noticed neither the open nor the close + routines attempt to reset or change the radio settings. This is intentional. + It allows an application to set up the radio and exit. It avoids a user + having to leave an application running all the time just to listen to the + radio. + + + + The Ioctl Interface + + This leaves the ioctl routine, without which the driver will not be + terribly useful to anyone. + + + + +static int radio_ioctl(struct video_device *dev, unsigned int cmd, void *arg) +{ + switch(cmd) + { + case VIDIOCGCAP: + { + struct video_capability v; + v.type = VID_TYPE_TUNER; + v.channels = 1; + v.audios = 1; + v.maxwidth = 0; + v.minwidth = 0; + v.maxheight = 0; + v.minheight = 0; + strcpy(v.name, "My Radio"); + if(copy_to_user(arg, &v, sizeof(v))) + return -EFAULT; + return 0; + } + + + + VIDIOCGCAP is the first ioctl all video4linux devices must support. It + allows the applications to find out what sort of a card they have found and + to figure out what they want to do about it. The fields in the structure are + + struct video_capability fields + + + + nameThe device text name. This is intended for the user. + + channelsThe number of different channels you can tune on + this card. It could even by zero for a card that has + no tuning capability. For our simple FM radio it is 1. + An AM/FM radio would report 2. + + audiosThe number of audio inputs on this device. For our + radio there is only one audio input. + + minwidth,minheightThe smallest size the card is capable of capturing + images in. We set these to zero. Radios do not + capture pictures + + maxwidth,maxheightThe largest image size the card is capable of + capturing. For our radio we report 0. + + + typeThis reports the capabilities of the device, and + matches the field we filled in in the struct + video_device when registering. + + + +
+ + Having filled in the fields, we use copy_to_user to copy the structure into + the users buffer. If the copy fails we return an EFAULT to the application + so that it knows it tried to feed us garbage. + + + The next pair of ioctl operations select which tuner is to be used and let + the application find the tuner properties. We have only a single FM band + tuner in our example device. + + + + + case VIDIOCGTUNER: + { + struct video_tuner v; + if(copy_from_user(&v, arg, sizeof(v))!=0) + return -EFAULT; + if(v.tuner) + return -EINVAL; + v.rangelow=(87*16000); + v.rangehigh=(108*16000); + v.flags = VIDEO_TUNER_LOW; + v.mode = VIDEO_MODE_AUTO; + v.signal = 0xFFFF; + strcpy(v.name, "FM"); + if(copy_to_user(&v, arg, sizeof(v))!=0) + return -EFAULT; + return 0; + } + + + + The VIDIOCGTUNER ioctl allows applications to query a tuner. The application + sets the tuner field to the tuner number it wishes to query. The query does + not change the tuner that is being used, it merely enquires about the tuner + in question. + + + We have exactly one tuner so after copying the user buffer to our temporary + structure we complain if they asked for a tuner other than tuner 0. + + + The video_tuner structure has the following fields + + struct video_tuner fields + + + + int tunerThe number of the tuner in question + + char name[32]A text description of this tuner. "FM" will do fine. + This is intended for the application. + + u32 flags + Tuner capability flags + + + u16 modeThe current reception mode + + + u16 signalThe signal strength scaled between 0 and 65535. If + a device cannot tell the signal strength it should + report 65535. Many simple cards contain only a + signal/no signal bit. Such cards will report either + 0 or 65535. + + + u32 rangelow, rangehigh + The range of frequencies supported by the radio + or TV. It is scaled according to the VIDEO_TUNER_LOW + flag. + + + + +
+ + struct video_tuner flags + + + + VIDEO_TUNER_PALA PAL TV tuner + + VIDEO_TUNER_NTSCAn NTSC (US) TV tuner + + VIDEO_TUNER_SECAMA SECAM (French) TV tuner + + VIDEO_TUNER_LOW + The tuner frequency is scaled in 1/16th of a KHz + steps. If not it is in 1/16th of a MHz steps + + + VIDEO_TUNER_NORMThe tuner can set its format + + VIDEO_TUNER_STEREO_ONThe tuner is currently receiving a stereo signal + + + +
+ + struct video_tuner modes + + + + VIDEO_MODE_PALPAL Format + + VIDEO_MODE_NTSCNTSC Format (USA) + + VIDEO_MODE_SECAMFrench Format + + VIDEO_MODE_AUTOA device that does not need to do + TV format switching + + + +
+ + The settings for the radio card are thus fairly simple. We report that we + are a tuner called "FM" for FM radio. In order to get the best tuning + resolution we report VIDEO_TUNER_LOW and select tuning to 1/16th of KHz. Its + unlikely our card can do that resolution but it is a fair bet the card can + do better than 1/16th of a MHz. VIDEO_TUNER_LOW is appropriate to almost all + radio usage. + + + We report that the tuner automatically handles deciding what format it is + receiving - true enough as it only handles FM radio. Our example card is + also incapable of detecting stereo or signal strengths so it reports a + strength of 0xFFFF (maximum) and no stereo detected. + + + To finish off we set the range that can be tuned to be 87-108Mhz, the normal + FM broadcast radio range. It is important to find out what the card is + actually capable of tuning. It is easy enough to simply use the FM broadcast + range. Unfortunately if you do this you will discover the FM broadcast + ranges in the USA, Europe and Japan are all subtly different and some users + cannot receive all the stations they wish. + + + The application also needs to be able to set the tuner it wishes to use. In + our case, with a single tuner this is rather simple to arrange. + + + + case VIDIOCSTUNER: + { + struct video_tuner v; + if(copy_from_user(&v, arg, sizeof(v))) + return -EFAULT; + if(v.tuner != 0) + return -EINVAL; + return 0; + } + + + + We copy the user supplied structure into kernel memory so we can examine it. + If the user has selected a tuner other than zero we reject the request. If + they wanted tuner 0 then, surprisingly enough, that is the current tuner already. + + + The next two ioctls we need to provide are to get and set the frequency of + the radio. These both use an unsigned long argument which is the frequency. + The scale of the frequency depends on the VIDEO_TUNER_LOW flag as I + mentioned earlier on. Since we have VIDEO_TUNER_LOW set this will be in + 1/16ths of a KHz. + + + +static unsigned long current_freq; + + + + case VIDIOCGFREQ: + if(copy_to_user(arg, &current_freq, + sizeof(unsigned long)) + return -EFAULT; + return 0; + + + + Querying the frequency in our case is relatively simple. Our radio card is + too dumb to let us query the signal strength so we remember our setting if + we know it. All we have to do is copy it to the user. + + + + + case VIDIOCSFREQ: + { + u32 freq; + if(copy_from_user(arg, &freq, + sizeof(unsigned long))!=0) + return -EFAULT; + if(hardware_set_freq(freq)<0) + return -EINVAL; + current_freq = freq; + return 0; + } + + + + Setting the frequency is a little more complex. We begin by copying the + desired frequency into kernel space. Next we call a hardware specific routine + to set the radio up. This might be as simple as some scaling and a few + writes to an I/O port. For most radio cards it turns out a good deal more + complicated and may involve programming things like a phase locked loop on + the card. This is what documentation is for. + + + The final set of operations we need to provide for our radio are the + volume controls. Not all radio cards can even do volume control. After all + there is a perfectly good volume control on the sound card. We will assume + our radio card has a simple 4 step volume control. + + + There are two ioctls with audio we need to support + + + +static int current_volume=0; + + case VIDIOCGAUDIO: + { + struct video_audio v; + if(copy_from_user(&v, arg, sizeof(v))) + return -EFAULT; + if(v.audio != 0) + return -EINVAL; + v.volume = 16384*current_volume; + v.step = 16384; + strcpy(v.name, "Radio"); + v.mode = VIDEO_SOUND_MONO; + v.balance = 0; + v.base = 0; + v.treble = 0; + + if(copy_to_user(arg. &v, sizeof(v))) + return -EFAULT; + return 0; + } + + + + Much like the tuner we start by copying the user structure into kernel + space. Again we check if the user has asked for a valid audio input. We have + only input 0 and we punt if they ask for another input. + + + Then we fill in the video_audio structure. This has the following format + + struct video_audio fields + + + + audioThe input the user wishes to query + + volumeThe volume setting on a scale of 0-65535 + + baseThe base level on a scale of 0-65535 + + trebleThe treble level on a scale of 0-65535 + + flagsThe features this audio device supports + + + nameA text name to display to the user. We picked + "Radio" as it explains things quite nicely. + + modeThe current reception mode for the audio + + We report MONO because our card is too stupid to know if it is in + mono or stereo. + + + balanceThe stereo balance on a scale of 0-65535, 32768 is + middle. + + stepThe step by which the volume control jumps. This is + used to help make it easy for applications to set + slider behaviour. + + + +
+ + struct video_audio flags + + + + VIDEO_AUDIO_MUTEThe audio is currently muted. We + could fake this in our driver but we + choose not to bother. + + VIDEO_AUDIO_MUTABLEThe input has a mute option + + VIDEO_AUDIO_TREBLEThe input has a treble control + + VIDEO_AUDIO_BASSThe input has a base control + + + +
+ + struct video_audio modes + + + + VIDEO_SOUND_MONOMono sound + + VIDEO_SOUND_STEREOStereo sound + + VIDEO_SOUND_LANG1Alternative language 1 (TV specific) + + VIDEO_SOUND_LANG2Alternative language 2 (TV specific) + + + +
+ + Having filled in the structure we copy it back to user space. + + + The VIDIOCSAUDIO ioctl allows the user to set the audio parameters in the + video_audio structure. The driver does its best to honour the request. + + + + case VIDIOCSAUDIO: + { + struct video_audio v; + if(copy_from_user(&v, arg, sizeof(v))) + return -EFAULT; + if(v.audio) + return -EINVAL; + current_volume = v/16384; + hardware_set_volume(current_volume); + return 0; + } + + + + In our case there is very little that the user can set. The volume is + basically the limit. Note that we could pretend to have a mute feature + by rewriting this to + + + + case VIDIOCSAUDIO: + { + struct video_audio v; + if(copy_from_user(&v, arg, sizeof(v))) + return -EFAULT; + if(v.audio) + return -EINVAL; + current_volume = v/16384; + if(v.flags&VIDEO_AUDIO_MUTE) + hardware_set_volume(0); + else + hardware_set_volume(current_volume); + current_muted = v.flags & + VIDEO_AUDIO_MUTE; + return 0; + } + + + + This with the corresponding changes to the VIDIOCGAUDIO code to report the + state of the mute flag we save and to report the card has a mute function, + will allow applications to use a mute facility with this card. It is + questionable whether this is a good idea however. User applications can already + fake this themselves and kernel space is precious. + + + We now have a working radio ioctl handler. So we just wrap up the function + + + + + } + return -ENOIOCTLCMD; +} + + + + and pass the Video4Linux layer back an error so that it knows we did not + understand the request we got passed. + +
+ + Module Wrapper + + Finally we add in the usual module wrapping and the driver is done. + + + +#ifndef MODULE + +static int io = 0x300; + +#else + +static int io = -1; + +#endif + +MODULE_AUTHOR("Alan Cox"); +MODULE_DESCRIPTION("A driver for an imaginary radio card."); +module_param(io, int, 0444); +MODULE_PARM_DESC(io, "I/O address of the card."); + +static int __init init(void) +{ + if(io==-1) + { + printk(KERN_ERR + "You must set an I/O address with io=0x???\n"); + return -EINVAL; + } + return myradio_init(NULL); +} + +static void __exit cleanup(void) +{ + video_unregister_device(&my_radio); + release_region(io, MY_IO_SIZE); +} + +module_init(init); +module_exit(cleanup); + + + + In this example we set the IO base by default if the driver is compiled into + the kernel: you can still set it using "my_radio.irq" if this file is called my_radio.c. For the module we require the + user sets the parameter. We set io to a nonsense port (-1) so that we can + tell if the user supplied an io parameter or not. + + + We use MODULE_ defines to give an author for the card driver and a + description. We also use them to declare that io is an integer and it is the + address of the card, and can be read by anyone from sysfs. + + + The clean-up routine unregisters the video_device we registered, and frees + up the I/O space. Note that the unregister takes the actual video_device + structure as its argument. Unlike the file operations structure which can be + shared by all instances of a device a video_device structure as an actual + instance of the device. If you are registering multiple radio devices you + need to fill in one structure per device (most likely by setting up a + template and copying it to each of the actual device structures). + + +
+ + Video Capture Devices + + Video Capture Device Types + + The video capture devices share the same interfaces as radio devices. In + order to explain the video capture interface I will use the example of a + camera that has no tuners or audio input. This keeps the example relatively + clean. To get both combine the two driver examples. + + + Video capture devices divide into four categories. A little technology + backgrounder. Full motion video even at television resolution (which is + actually fairly low) is pretty resource-intensive. You are continually + passing megabytes of data every second from the capture card to the display. + several alternative approaches have emerged because copying this through the + processor and the user program is a particularly bad idea . + + + The first is to add the television image onto the video output directly. + This is also how some 3D cards work. These basic cards can generally drop the + video into any chosen rectangle of the display. Cards like this, which + include most mpeg1 cards that used the feature connector, aren't very + friendly in a windowing environment. They don't understand windows or + clipping. The video window is always on the top of the display. + + + Chroma keying is a technique used by cards to get around this. It is an old + television mixing trick where you mark all the areas you wish to replace + with a single clear colour that isn't used in the image - TV people use an + incredibly bright blue while computing people often use a particularly + virulent purple. Bright blue occurs on the desktop. Anyone with virulent + purple windows has another problem besides their TV overlay. + + + The third approach is to copy the data from the capture card to the video + card, but to do it directly across the PCI bus. This relieves the processor + from doing the work but does require some smartness on the part of the video + capture chip, as well as a suitable video card. Programming this kind of + card and more so debugging it can be extremely tricky. There are some quite + complicated interactions with the display and you may also have to cope with + various chipset bugs that show up when PCI cards start talking to each + other. + + + To keep our example fairly simple we will assume a card that supports + overlaying a flat rectangular image onto the frame buffer output, and which + can also capture stuff into processor memory. + + + + Registering Video Capture Devices + + This time we need to add more functions for our camera device. + + +static struct video_device my_camera +{ + "My Camera", + VID_TYPE_OVERLAY|VID_TYPE_SCALES|\ + VID_TYPE_CAPTURE|VID_TYPE_CHROMAKEY, + VID_HARDWARE_MYCAMERA, + camera_open. + camera_close, + camera_read, /* no read */ + NULL, /* no write */ + camera_poll, /* no poll */ + camera_ioctl, + NULL, /* no special init function */ + NULL /* no private data */ +}; + + + We need a read() function which is used for capturing data from + the card, and we need a poll function so that a driver can wait for the next + frame to be captured. + + + We use the extra video capability flags that did not apply to the + radio interface. The video related flags are + + Capture Capabilities + + + +VID_TYPE_CAPTUREWe support image capture + +VID_TYPE_TELETEXTA teletext capture device (vbi{n]) + +VID_TYPE_OVERLAYThe image can be directly overlaid onto the + frame buffer + +VID_TYPE_CHROMAKEYChromakey can be used to select which parts + of the image to display + +VID_TYPE_CLIPPINGIt is possible to give the board a list of + rectangles to draw around. + +VID_TYPE_FRAMERAMThe video capture goes into the video memory + and actually changes it. Applications need + to know this so they can clean up after the + card + +VID_TYPE_SCALESThe image can be scaled to various sizes, + rather than being a single fixed size. + +VID_TYPE_MONOCHROMEThe capture will be monochrome. This isn't a + complete answer to the question since a mono + camera on a colour capture card will still + produce mono output. + +VID_TYPE_SUBCAPTUREThe card allows only part of its field of + view to be captured. This enables + applications to avoid copying all of a large + image into memory when only some section is + relevant. + + + +
+ + We set VID_TYPE_CAPTURE so that we are seen as a capture card, + VID_TYPE_CHROMAKEY so the application knows it is time to draw in virulent + purple, and VID_TYPE_SCALES because we can be resized. + + + Our setup is fairly similar. This time we also want an interrupt line + for the 'frame captured' signal. Not all cards have this so some of them + cannot handle poll(). + + + + +static int io = 0x320; +static int irq = 11; + +int __init mycamera_init(struct video_init *v) +{ + if(!request_region(io, MY_IO_SIZE, "mycamera")) + { + printk(KERN_ERR + "mycamera: port 0x%03X is in use.\n", io); + return -EBUSY; + } + + if(video_device_register(&my_camera, + VFL_TYPE_GRABBER)==-1) { + release_region(io, MY_IO_SIZE); + return -EINVAL; + } + return 0; +} + + + + This is little changed from the needs of the radio card. We specify + VFL_TYPE_GRABBER this time as we want to be allocated a /dev/video name. + +
+ + Opening And Closing The Capture Device + + + +static int users = 0; + +static int camera_open(stuct video_device *dev, int flags) +{ + if(users) + return -EBUSY; + if(request_irq(irq, camera_irq, 0, "camera", dev)<0) + return -EBUSY; + users++; + return 0; +} + + +static int camera_close(struct video_device *dev) +{ + users--; + free_irq(irq, dev); +} + + + The open and close routines are also quite similar. The only real change is + that we now request an interrupt for the camera device interrupt line. If we + cannot get the interrupt we report EBUSY to the application and give up. + + + + Interrupt Handling + + Our example handler is for an ISA bus device. If it was PCI you would be + able to share the interrupt and would have set SA_SHIRQ to indicate a + shared IRQ. We pass the device pointer as the interrupt routine argument. We + don't need to since we only support one card but doing this will make it + easier to upgrade the driver for multiple devices in the future. + + + Our interrupt routine needs to do little if we assume the card can simply + queue one frame to be read after it captures it. + + + + +static struct wait_queue *capture_wait; +static int capture_ready = 0; + +static void camera_irq(int irq, void *dev_id, + struct pt_regs *regs) +{ + capture_ready=1; + wake_up_interruptible(&capture_wait); +} + + + The interrupt handler is nice and simple for this card as we are assuming + the card is buffering the frame for us. This means we have little to do but + wake up anybody interested. We also set a capture_ready flag, as we may + capture a frame before an application needs it. In this case we need to know + that a frame is ready. If we had to collect the frame on the interrupt life + would be more complex. + + + The two new routines we need to supply are camera_read which returns a + frame, and camera_poll which waits for a frame to become ready. + + + + +static int camera_poll(struct video_device *dev, + struct file *file, struct poll_table *wait) +{ + poll_wait(file, &capture_wait, wait); + if(capture_read) + return POLLIN|POLLRDNORM; + return 0; +} + + + + Our wait queue for polling is the capture_wait queue. This will cause the + task to be woken up by our camera_irq routine. We check capture_read to see + if there is an image present and if so report that it is readable. + + + + Reading The Video Image + + + +static long camera_read(struct video_device *dev, char *buf, + unsigned long count) +{ + struct wait_queue wait = { current, NULL }; + u8 *ptr; + int len; + int i; + + add_wait_queue(&capture_wait, &wait); + + while(!capture_ready) + { + if(file->flags&O_NDELAY) + { + remove_wait_queue(&capture_wait, &wait); + current->state = TASK_RUNNING; + return -EWOULDBLOCK; + } + if(signal_pending(current)) + { + remove_wait_queue(&capture_wait, &wait); + current->state = TASK_RUNNING; + return -ERESTARTSYS; + } + schedule(); + current->state = TASK_INTERRUPTIBLE; + } + remove_wait_queue(&capture_wait, &wait); + current->state = TASK_RUNNING; + + + + The first thing we have to do is to ensure that the application waits until + the next frame is ready. The code here is almost identical to the mouse code + we used earlier in this chapter. It is one of the common building blocks of + Linux device driver code and probably one which you will find occurs in any + drivers you write. + + + We wait for a frame to be ready, or for a signal to interrupt our waiting. If a + signal occurs we need to return from the system call so that the signal can + be sent to the application itself. We also check to see if the user actually + wanted to avoid waiting - ie if they are using non-blocking I/O and have other things + to get on with. + + + Next we copy the data from the card to the user application. This is rarely + as easy as our example makes out. We will add capture_w, and capture_h here + to hold the width and height of the captured image. We assume the card only + supports 24bit RGB for now. + + + + + + capture_ready = 0; + + ptr=(u8 *)buf; + len = capture_w * 3 * capture_h; /* 24bit RGB */ + + if(len>count) + len=count; /* Doesn't all fit */ + + for(i=0; i<len; i++) + { + put_user(inb(io+IMAGE_DATA), ptr); + ptr++; + } + + hardware_restart_capture(); + + return i; +} + + + + For a real hardware device you would try to avoid the loop with put_user(). + Each call to put_user() has a time overhead checking whether the accesses to user + space are allowed. It would be better to read a line into a temporary buffer + then copy this to user space in one go. + + + Having captured the image and put it into user space we can kick the card to + get the next frame acquired. + + + + Video Ioctl Handling + + As with the radio driver the major control interface is via the ioctl() + function. Video capture devices support the same tuner calls as a radio + device and also support additional calls to control how the video functions + are handled. In this simple example the card has no tuners to avoid making + the code complex. + + + + + +static int camera_ioctl(struct video_device *dev, unsigned int cmd, void *arg) +{ + switch(cmd) + { + case VIDIOCGCAP: + { + struct video_capability v; + v.type = VID_TYPE_CAPTURE|\ + VID_TYPE_CHROMAKEY|\ + VID_TYPE_SCALES|\ + VID_TYPE_OVERLAY; + v.channels = 1; + v.audios = 0; + v.maxwidth = 640; + v.minwidth = 16; + v.maxheight = 480; + v.minheight = 16; + strcpy(v.name, "My Camera"); + if(copy_to_user(arg, &v, sizeof(v))) + return -EFAULT; + return 0; + } + + + + + The first ioctl we must support and which all video capture and radio + devices are required to support is VIDIOCGCAP. This behaves exactly the same + as with a radio device. This time, however, we report the extra capabilities + we outlined earlier on when defining our video_dev structure. + + + We now set the video flags saying that we support overlay, capture, + scaling and chromakey. We also report size limits - our smallest image is + 16x16 pixels, our largest is 640x480. + + + To keep things simple we report no audio and no tuning capabilities at all. + + + + case VIDIOCGCHAN: + { + struct video_channel v; + if(copy_from_user(&v, arg, sizeof(v))) + return -EFAULT; + if(v.channel != 0) + return -EINVAL; + v.flags = 0; + v.tuners = 0; + v.type = VIDEO_TYPE_CAMERA; + v.norm = VIDEO_MODE_AUTO; + strcpy(v.name, "Camera Input");break; + if(copy_to_user(&v, arg, sizeof(v))) + return -EFAULT; + return 0; + } + + + + + This follows what is very much the standard way an ioctl handler looks + in Linux. We copy the data into a kernel space variable and we check that the + request is valid (in this case that the input is 0). Finally we copy the + camera info back to the user. + + + The VIDIOCGCHAN ioctl allows a user to ask about video channels (that is + inputs to the video card). Our example card has a single camera input. The + fields in the structure are + + struct video_channel fields + + + + + channelThe channel number we are selecting + + nameThe name for this channel. This is intended + to describe the port to the user. + Appropriate names are therefore things like + "Camera" "SCART input" + + flagsChannel properties + + typeInput type + + normThe current television encoding being used + if relevant for this channel. + + + + +
+ struct video_channel flags + + + + VIDEO_VC_TUNERChannel has a tuner. + + VIDEO_VC_AUDIOChannel has audio. + + + +
+ struct video_channel types + + + + VIDEO_TYPE_TVTelevision input. + + VIDEO_TYPE_CAMERAFixed camera input. + + 0Type is unknown. + + + +
+ struct video_channel norms + + + + VIDEO_MODE_PALPAL encoded Television + + VIDEO_MODE_NTSCNTSC (US) encoded Television + + VIDEO_MODE_SECAMSECAM (French) Television + + VIDEO_MODE_AUTOAutomatic switching, or format does not + matter + + + +
+ + The corresponding VIDIOCSCHAN ioctl allows a user to change channel and to + request the norm is changed - for example to switch between a PAL or an NTSC + format camera. + + + + + case VIDIOCSCHAN: + { + struct video_channel v; + if(copy_from_user(&v, arg, sizeof(v))) + return -EFAULT; + if(v.channel != 0) + return -EINVAL; + if(v.norm != VIDEO_MODE_AUTO) + return -EINVAL; + return 0; + } + + + + + The implementation of this call in our driver is remarkably easy. Because we + are assuming fixed format hardware we need only check that the user has not + tried to change anything. + + + The user also needs to be able to configure and adjust the picture they are + seeing. This is much like adjusting a television set. A user application + also needs to know the palette being used so that it knows how to display + the image that has been captured. The VIDIOCGPICT and VIDIOCSPICT ioctl + calls provide this information. + + + + + case VIDIOCGPICT + { + struct video_picture v; + v.brightness = hardware_brightness(); + v.hue = hardware_hue(); + v.colour = hardware_saturation(); + v.contrast = hardware_brightness(); + /* Not settable */ + v.whiteness = 32768; + v.depth = 24; /* 24bit */ + v.palette = VIDEO_PALETTE_RGB24; + if(copy_to_user(&v, arg, + sizeof(v))) + return -EFAULT; + return 0; + } + + + + + The brightness, hue, color, and contrast provide the picture controls that + are akin to a conventional television. Whiteness provides additional + control for greyscale images. All of these values are scaled between 0-65535 + and have 32768 as the mid point setting. The scaling means that applications + do not have to worry about the capability range of the hardware but can let + it make a best effort attempt. + + + Our depth is 24, as this is in bits. We will be returning RGB24 format. This + has one byte of red, then one of green, then one of blue. This then repeats + for every other pixel in the image. The other common formats the interface + defines are + + Framebuffer Encodings + + + + GREYLinear greyscale. This is for simple cameras and the + like + + RGB565The top 5 bits hold 32 red levels, the next six bits + hold green and the low 5 bits hold blue. + + RGB555The top bit is clear. The red green and blue levels + each occupy five bits. + + + +
+ + Additional modes are support for YUV capture formats. These are common for + TV and video conferencing applications. + + + The VIDIOCSPICT ioctl allows a user to set some of the picture parameters. + Exactly which ones are supported depends heavily on the card itself. It is + possible to support many modes and effects in software. In general doing + this in the kernel is a bad idea. Video capture is a performance-sensitive + application and the programs can often do better if they aren't being + 'helped' by an overkeen driver writer. Thus for our device we will report + RGB24 only and refuse to allow a change. + + + + + case VIDIOCSPICT: + { + struct video_picture v; + if(copy_from_user(&v, arg, sizeof(v))) + return -EFAULT; + if(v.depth!=24 || + v.palette != VIDEO_PALETTE_RGB24) + return -EINVAL; + set_hardware_brightness(v.brightness); + set_hardware_hue(v.hue); + set_hardware_saturation(v.colour); + set_hardware_brightness(v.contrast); + return 0; + } + + + + + We check the user has not tried to change the palette or the depth. We do + not want to carry out some of the changes and then return an error. This may + confuse the application which will be assuming no change occurred. + + + In much the same way as you need to be able to set the picture controls to + get the right capture images, many cards need to know what they are + displaying onto when generating overlay output. In some cases getting this + wrong even makes a nasty mess or may crash the computer. For that reason + the VIDIOCSBUF ioctl used to set up the frame buffer information may well + only be usable by root. + + + We will assume our card is one of the old ISA devices with feature connector + and only supports a couple of standard video modes. Very common for older + cards although the PCI devices are way smarter than this. + + + + +static struct video_buffer capture_fb; + + case VIDIOCGFBUF: + { + if(copy_to_user(arg, &capture_fb, + sizeof(capture_fb))) + return -EFAULT; + return 0; + + } + + + + + We keep the frame buffer information in the format the ioctl uses. This + makes it nice and easy to work with in the ioctl calls. + + + + case VIDIOCSFBUF: + { + struct video_buffer v; + + if(!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if(copy_from_user(&v, arg, sizeof(v))) + return -EFAULT; + if(v.width!=320 && v.width!=640) + return -EINVAL; + if(v.height!=200 && v.height!=240 + && v.height!=400 + && v.height !=480) + return -EINVAL; + memcpy(&capture_fb, &v, sizeof(v)); + hardware_set_fb(&v); + return 0; + } + + + + + + The capable() function checks a user has the required capability. The Linux + operating system has a set of about 30 capabilities indicating privileged + access to services. The default set up gives the superuser (uid 0) all of + them and nobody else has any. + + + We check that the user has the SYS_ADMIN capability, that is they are + allowed to operate as the machine administrator. We don't want anyone but + the administrator making a mess of the display. + + + Next we check for standard PC video modes (320 or 640 wide with either + EGA or VGA depths). If the mode is not a standard video mode we reject it as + not supported by our card. If the mode is acceptable we save it so that + VIDIOCFBUF will give the right answer next time it is called. The + hardware_set_fb() function is some undescribed card specific function to + program the card for the desired mode. + + + Before the driver can display an overlay window it needs to know where the + window should be placed, and also how large it should be. If the card + supports clipping it needs to know which rectangles to omit from the + display. The video_window structure is used to describe the way the image + should be displayed. + + struct video_window fields + + + + widthThe width in pixels of the desired image. The card + may use a smaller size if this size is not available + + heightThe height of the image. The card may use a smaller + size if this size is not available. + + x The X position of the top left of the window. This + is in pixels relative to the left hand edge of the + picture. Not all cards can display images aligned on + any pixel boundary. If the position is unsuitable + the card adjusts the image right and reduces the + width. + + y The Y position of the top left of the window. This + is counted in pixels relative to the top edge of the + picture. As with the width if the card cannot + display starting on this line it will adjust the + values. + + chromakeyThe colour (expressed in RGB32 format) for the + chromakey colour if chroma keying is being used. + + clipsAn array of rectangles that must not be drawn + over. + + clipcountThe number of clips in this array. + + + +
+ + Each clip is a struct video_clip which has the following fields + + video_clip fields + + + + x, yCo-ordinates relative to the display + + width, heightWidth and height in pixels + + nextA spare field for the application to use + + + +
+ + The driver is required to ensure it always draws in the area requested or a smaller area, and that it never draws in any of the areas that are clipped. + This may well mean it has to leave alone. small areas the application wished to be + drawn. + + + Our example card uses chromakey so does not have to address most of the + clipping. We will add a video_window structure to our global variables to + remember our parameters, as we did with the frame buffer. + + + + + case VIDIOCGWIN: + { + if(copy_to_user(arg, &capture_win, + sizeof(capture_win))) + return -EFAULT; + return 0; + } + + + case VIDIOCSWIN: + { + struct video_window v; + if(copy_from_user(&v, arg, sizeof(v))) + return -EFAULT; + if(v.width > 640 || v.height > 480) + return -EINVAL; + if(v.width < 16 || v.height < 16) + return -EINVAL; + hardware_set_key(v.chromakey); + hardware_set_window(v); + memcpy(&capture_win, &v, sizeof(v)); + capture_w = v.width; + capture_h = v.height; + return 0; + } + + + + + Because we are using Chromakey our setup is fairly simple. Mostly we have to + check the values are sane and load them into the capture card. + + + With all the setup done we can now turn on the actual capture/overlay. This + is done with the VIDIOCCAPTURE ioctl. This takes a single integer argument + where 0 is on and 1 is off. + + + + + case VIDIOCCAPTURE: + { + int v; + if(get_user(v, (int *)arg)) + return -EFAULT; + if(v==0) + hardware_capture_off(); + else + { + if(capture_fb.width == 0 + || capture_w == 0) + return -EINVAL; + hardware_capture_on(); + } + return 0; + } + + + + + We grab the flag from user space and either enable or disable according to + its value. There is one small corner case we have to consider here. Suppose + that the capture was requested before the video window or the frame buffer + had been set up. In those cases there will be unconfigured fields in our + card data, as well as unconfigured hardware settings. We check for this case and + return an error if the frame buffer or the capture window width is zero. + + + + + default: + return -ENOIOCTLCMD; + } +} + + + + We don't need to support any other ioctls, so if we get this far, it is time + to tell the video layer that we don't now what the user is talking about. + +
+ + Other Functionality + + The Video4Linux layer supports additional features, including a high + performance mmap() based capture mode and capturing part of the image. + These features are out of the scope of the book. You should however have enough + example code to implement most simple video4linux devices for radio and TV + cards. + + +
+ + Known Bugs And Assumptions + + + Multiple Opens + + + The driver assumes multiple opens should not be allowed. A driver + can work around this but not cleanly. + + + + API Deficiencies + + + The existing API poorly reflects compression capable devices. There + are plans afoot to merge V4L, V4L2 and some other ideas into a + better interface. + + + + + + + + + Public Functions Provided +!Edrivers/media/video/videodev.c + + +
diff --git a/Documentation/DocBook/wanbook.tmpl b/Documentation/DocBook/wanbook.tmpl new file mode 100644 index 000000000000..9eebcc304de4 --- /dev/null +++ b/Documentation/DocBook/wanbook.tmpl @@ -0,0 +1,99 @@ + + + + + + Synchronous PPP and Cisco HDLC Programming Guide + + + + Alan + Cox + +
+ alan@redhat.com +
+
+
+
+ + + 2000 + Alan Cox + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + +
+ + + + + Introduction + + The syncppp drivers in Linux provide a fairly complete + implementation of Cisco HDLC and a minimal implementation of + PPP. The longer term goal is to switch the PPP layer to the + generic PPP interface that is new in Linux 2.3.x. The API should + remain unchanged when this is done, but support will then be + available for IPX, compression and other PPP features + + + + Known Bugs And Assumptions + + + PPP is minimal + + + The current PPP implementation is very basic, although sufficient + for most wan usages. + + + + Cisco HDLC Quirks + + + Currently we do not end all packets with the correct Cisco multicast + or unicast flags. Nothing appears to mind too much but this should + be corrected. + + + + + + + + + Public Functions Provided +!Edrivers/net/wan/syncppp.c + + +
diff --git a/Documentation/DocBook/writing_usb_driver.tmpl b/Documentation/DocBook/writing_usb_driver.tmpl new file mode 100644 index 000000000000..51f3bfb6fb6e --- /dev/null +++ b/Documentation/DocBook/writing_usb_driver.tmpl @@ -0,0 +1,419 @@ + + + + + + Writing USB Device Drivers + + + + Greg + Kroah-Hartman + +
+ greg@kroah.com +
+
+
+
+ + + 2001-2002 + Greg Kroah-Hartman + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + + + This documentation is based on an article published in + Linux Journal Magazine, October 2001, Issue 90. + + +
+ + + + + Introduction + + The Linux USB subsystem has grown from supporting only two different + types of devices in the 2.2.7 kernel (mice and keyboards), to over 20 + different types of devices in the 2.4 kernel. Linux currently supports + almost all USB class devices (standard types of devices like keyboards, + mice, modems, printers and speakers) and an ever-growing number of + vendor-specific devices (such as USB to serial converters, digital + cameras, Ethernet devices and MP3 players). For a full list of the + different USB devices currently supported, see Resources. + + + The remaining kinds of USB devices that do not have support on Linux are + almost all vendor-specific devices. Each vendor decides to implement a + custom protocol to talk to their device, so a custom driver usually needs + to be created. Some vendors are open with their USB protocols and help + with the creation of Linux drivers, while others do not publish them, and + developers are forced to reverse-engineer. See Resources for some links + to handy reverse-engineering tools. + + + Because each different protocol causes a new driver to be created, I have + written a generic USB driver skeleton, modeled after the pci-skeleton.c + file in the kernel source tree upon which many PCI network drivers have + been based. This USB skeleton can be found at drivers/usb/usb-skeleton.c + in the kernel source tree. In this article I will walk through the basics + of the skeleton driver, explaining the different pieces and what needs to + be done to customize it to your specific device. + + + + + Linux USB Basics + + If you are going to write a Linux USB driver, please become familiar with + the USB protocol specification. It can be found, along with many other + useful documents, at the USB home page (see Resources). An excellent + introduction to the Linux USB subsystem can be found at the USB Working + Devices List (see Resources). It explains how the Linux USB subsystem is + structured and introduces the reader to the concept of USB urbs, which + are essential to USB drivers. + + + The first thing a Linux USB driver needs to do is register itself with + the Linux USB subsystem, giving it some information about which devices + the driver supports and which functions to call when a device supported + by the driver is inserted or removed from the system. All of this + information is passed to the USB subsystem in the usb_driver structure. + The skeleton driver declares a usb_driver as: + + +static struct usb_driver skel_driver = { + .name = "skeleton", + .probe = skel_probe, + .disconnect = skel_disconnect, + .fops = &skel_fops, + .minor = USB_SKEL_MINOR_BASE, + .id_table = skel_table, +}; + + + The variable name is a string that describes the driver. It is used in + informational messages printed to the system log. The probe and + disconnect function pointers are called when a device that matches the + information provided in the id_table variable is either seen or removed. + + + The fops and minor variables are optional. Most USB drivers hook into + another kernel subsystem, such as the SCSI, network or TTY subsystem. + These types of drivers register themselves with the other kernel + subsystem, and any user-space interactions are provided through that + interface. But for drivers that do not have a matching kernel subsystem, + such as MP3 players or scanners, a method of interacting with user space + is needed. The USB subsystem provides a way to register a minor device + number and a set of file_operations function pointers that enable this + user-space interaction. The skeleton driver needs this kind of interface, + so it provides a minor starting number and a pointer to its + file_operations functions. + + + The USB driver is then registered with a call to usb_register, usually in + the driver's init function, as shown here: + + +static int __init usb_skel_init(void) +{ + int result; + + /* register this driver with the USB subsystem */ + result = usb_register(&skel_driver); + if (result < 0) { + err("usb_register failed for the "__FILE__ "driver." + "Error number %d", result); + return -1; + } + + return 0; +} +module_init(usb_skel_init); + + + When the driver is unloaded from the system, it needs to unregister + itself with the USB subsystem. This is done with the usb_unregister + function: + + +static void __exit usb_skel_exit(void) +{ + /* deregister this driver with the USB subsystem */ + usb_deregister(&skel_driver); +} +module_exit(usb_skel_exit); + + + To enable the linux-hotplug system to load the driver automatically when + the device is plugged in, you need to create a MODULE_DEVICE_TABLE. The + following code tells the hotplug scripts that this module supports a + single device with a specific vendor and product ID: + + +/* table of devices that work with this driver */ +static struct usb_device_id skel_table [] = { + { USB_DEVICE(USB_SKEL_VENDOR_ID, USB_SKEL_PRODUCT_ID) }, + { } /* Terminating entry */ +}; +MODULE_DEVICE_TABLE (usb, skel_table); + + + There are other macros that can be used in describing a usb_device_id for + drivers that support a whole class of USB drivers. See usb.h for more + information on this. + + + + + Device operation + + When a device is plugged into the USB bus that matches the device ID + pattern that your driver registered with the USB core, the probe function + is called. The usb_device structure, interface number and the interface ID + are passed to the function: + + +static int skel_probe(struct usb_interface *interface, + const struct usb_device_id *id) + + + The driver now needs to verify that this device is actually one that it + can accept. If so, it returns 0. + If not, or if any error occurs during initialization, an errorcode + (such as -ENOMEM or -ENODEV) + is returned from the probe function. + + + In the skeleton driver, we determine what end points are marked as bulk-in + and bulk-out. We create buffers to hold the data that will be sent and + received from the device, and a USB urb to write data to the device is + initialized. + + + Conversely, when the device is removed from the USB bus, the disconnect + function is called with the device pointer. The driver needs to clean any + private data that has been allocated at this time and to shut down any + pending urbs that are in the USB system. The driver also unregisters + itself from the devfs subsystem with the call: + + +/* remove our devfs node */ +devfs_unregister(skel->devfs); + + + Now that the device is plugged into the system and the driver is bound to + the device, any of the functions in the file_operations structure that + were passed to the USB subsystem will be called from a user program trying + to talk to the device. The first function called will be open, as the + program tries to open the device for I/O. We increment our private usage + count and save off a pointer to our internal structure in the file + structure. This is done so that future calls to file operations will + enable the driver to determine which device the user is addressing. All + of this is done with the following code: + + +/* increment our usage count for the module */ +++skel->open_count; + +/* save our object in the file's private structure */ +file->private_data = dev; + + + After the open function is called, the read and write functions are called + to receive and send data to the device. In the skel_write function, we + receive a pointer to some data that the user wants to send to the device + and the size of the data. The function determines how much data it can + send to the device based on the size of the write urb it has created (this + size depends on the size of the bulk out end point that the device has). + Then it copies the data from user space to kernel space, points the urb to + the data and submits the urb to the USB subsystem. This can be shown in + he following code: + + +/* we can only write as much as 1 urb will hold */ +bytes_written = (count > skel->bulk_out_size) ? skel->bulk_out_size : count; + +/* copy the data from user space into our urb */ +copy_from_user(skel->write_urb->transfer_buffer, buffer, bytes_written); + +/* set up our urb */ +usb_fill_bulk_urb(skel->write_urb, + skel->dev, + usb_sndbulkpipe(skel->dev, skel->bulk_out_endpointAddr), + skel->write_urb->transfer_buffer, + bytes_written, + skel_write_bulk_callback, + skel); + +/* send the data out the bulk port */ +result = usb_submit_urb(skel->write_urb); +if (result) { + err("Failed submitting write urb, error %d", result); +} + + + When the write urb is filled up with the proper information using the + usb_fill_bulk_urb function, we point the urb's completion callback to call our + own skel_write_bulk_callback function. This function is called when the + urb is finished by the USB subsystem. The callback function is called in + interrupt context, so caution must be taken not to do very much processing + at that time. Our implementation of skel_write_bulk_callback merely + reports if the urb was completed successfully or not and then returns. + + + The read function works a bit differently from the write function in that + we do not use an urb to transfer data from the device to the driver. + Instead we call the usb_bulk_msg function, which can be used to send or + receive data from a device without having to create urbs and handle + urb completion callback functions. We call the usb_bulk_msg function, + giving it a buffer into which to place any data received from the device + and a timeout value. If the timeout period expires without receiving any + data from the device, the function will fail and return an error message. + This can be shown with the following code: + + +/* do an immediate bulk read to get data from the device */ +retval = usb_bulk_msg (skel->dev, + usb_rcvbulkpipe (skel->dev, + skel->bulk_in_endpointAddr), + skel->bulk_in_buffer, + skel->bulk_in_size, + &count, HZ*10); +/* if the read was successful, copy the data to user space */ +if (!retval) { + if (copy_to_user (buffer, skel->bulk_in_buffer, count)) + retval = -EFAULT; + else + retval = count; +} + + + The usb_bulk_msg function can be very useful for doing single reads or + writes to a device; however, if you need to read or write constantly to a + device, it is recommended to set up your own urbs and submit them to the + USB subsystem. + + + When the user program releases the file handle that it has been using to + talk to the device, the release function in the driver is called. In this + function we decrement our private usage count and wait for possible + pending writes: + + +/* decrement our usage count for the device */ +--skel->open_count; + + + One of the more difficult problems that USB drivers must be able to handle + smoothly is the fact that the USB device may be removed from the system at + any point in time, even if a program is currently talking to it. It needs + to be able to shut down any current reads and writes and notify the + user-space programs that the device is no longer there. The following + code (function skel_delete) + is an example of how to do this: + +static inline void skel_delete (struct usb_skel *dev) +{ + if (dev->bulk_in_buffer != NULL) + kfree (dev->bulk_in_buffer); + if (dev->bulk_out_buffer != NULL) + usb_buffer_free (dev->udev, dev->bulk_out_size, + dev->bulk_out_buffer, + dev->write_urb->transfer_dma); + if (dev->write_urb != NULL) + usb_free_urb (dev->write_urb); + kfree (dev); +} + + + If a program currently has an open handle to the device, we reset the flag + device_present. For + every read, write, release and other functions that expect a device to be + present, the driver first checks this flag to see if the device is + still present. If not, it releases that the device has disappeared, and a + -ENODEV error is returned to the user-space program. When the release + function is eventually called, it determines if there is no device + and if not, it does the cleanup that the skel_disconnect + function normally does if there are no open files on the device (see + Listing 5). + + + + + Isochronous Data + + This usb-skeleton driver does not have any examples of interrupt or + isochronous data being sent to or from the device. Interrupt data is sent + almost exactly as bulk data is, with a few minor exceptions. Isochronous + data works differently with continuous streams of data being sent to or + from the device. The audio and video camera drivers are very good examples + of drivers that handle isochronous data and will be useful if you also + need to do this. + + + + + Conclusion + + Writing Linux USB device drivers is not a difficult task as the + usb-skeleton driver shows. This driver, combined with the other current + USB drivers, should provide enough examples to help a beginning author + create a working driver in a minimal amount of time. The linux-usb-devel + mailing list archives also contain a lot of helpful information. + + + + + Resources + + The Linux USB Project: http://www.linux-usb.org/ + + + Linux Hotplug Project: http://linux-hotplug.sourceforge.net/ + + + Linux USB Working Devices List: http://www.qbik.ch/usb/devices/ + + + linux-usb-devel Mailing List Archives: http://marc.theaimsgroup.com/?l=linux-usb-devel + + + Programming Guide for Linux USB Device Drivers: http://usb.cs.tum.edu/usbdoc + + + USB Home Page: http://www.usb.org + + + +
diff --git a/Documentation/DocBook/z8530book.tmpl b/Documentation/DocBook/z8530book.tmpl new file mode 100644 index 000000000000..a507876447aa --- /dev/null +++ b/Documentation/DocBook/z8530book.tmpl @@ -0,0 +1,385 @@ + + + + + + Z8530 Programming Guide + + + + Alan + Cox + +
+ alan@redhat.com +
+
+
+
+ + + 2000 + Alan Cox + + + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + +
+ + + + + Introduction + + The Z85x30 family synchronous/asynchronous controller chips are + used on a large number of cheap network interface cards. The + kernel provides a core interface layer that is designed to make + it easy to provide WAN services using this chip. + + + The current driver only support synchronous operation. Merging the + asynchronous driver support into this code to allow any Z85x30 + device to be used as both a tty interface and as a synchronous + controller is a project for Linux post the 2.4 release + + + The support code handles most common card configurations and + supports running both Cisco HDLC and Synchronous PPP. With extra + glue the frame relay and X.25 protocols can also be used with this + driver. + + + + + Driver Modes + + The Z85230 driver layer can drive Z8530, Z85C30 and Z85230 devices + in three different modes. Each mode can be applied to an individual + channel on the chip (each chip has two channels). + + + The PIO synchronous mode supports the most common Z8530 wiring. Here + the chip is interface to the I/O and interrupt facilities of the + host machine but not to the DMA subsystem. When running PIO the + Z8530 has extremely tight timing requirements. Doing high speeds, + even with a Z85230 will be tricky. Typically you should expect to + achieve at best 9600 baud with a Z8C530 and 64Kbits with a Z85230. + + + The DMA mode supports the chip when it is configured to use dual DMA + channels on an ISA bus. The better cards tend to support this mode + of operation for a single channel. With DMA running the Z85230 tops + out when it starts to hit ISA DMA constraints at about 512Kbits. It + is worth noting here that many PC machines hang or crash when the + chip is driven fast enough to hold the ISA bus solid. + + + Transmit DMA mode uses a single DMA channel. The DMA channel is used + for transmission as the transmit FIFO is smaller than the receive + FIFO. it gives better performance than pure PIO mode but is nowhere + near as ideal as pure DMA mode. + + + + + Using the Z85230 driver + + The Z85230 driver provides the back end interface to your board. To + configure a Z8530 interface you need to detect the board and to + identify its ports and interrupt resources. It is also your problem + to verify the resources are available. + + + Having identified the chip you need to fill in a struct z8530_dev, + which describes each chip. This object must exist until you finally + shutdown the board. Firstly zero the active field. This ensures + nothing goes off without you intending it. The irq field should + be set to the interrupt number of the chip. (Each chip has a single + interrupt source rather than each channel). You are responsible + for allocating the interrupt line. The interrupt handler should be + set to z8530_interrupt. The device id should + be set to the z8530_dev structure pointer. Whether the interrupt can + be shared or not is board dependent, and up to you to initialise. + + + The structure holds two channel structures. + Initialise chanA.ctrlio and chanA.dataio with the address of the + control and data ports. You can or this with Z8530_PORT_SLEEP to + indicate your interface needs the 5uS delay for chip settling done + in software. The PORT_SLEEP option is architecture specific. Other + flags may become available on future platforms, eg for MMIO. + Initialise the chanA.irqs to &z8530_nop to start the chip up + as disabled and discarding interrupt events. This ensures that + stray interrupts will be mopped up and not hang the bus. Set + chanA.dev to point to the device structure itself. The + private and name field you may use as you wish. The private field + is unused by the Z85230 layer. The name is used for error reporting + and it may thus make sense to make it match the network name. + + + Repeat the same operation with the B channel if your chip has + both channels wired to something useful. This isn't always the + case. If it is not wired then the I/O values do not matter, but + you must initialise chanB.dev. + + + If your board has DMA facilities then initialise the txdma and + rxdma fields for the relevant channels. You must also allocate the + ISA DMA channels and do any necessary board level initialisation + to configure them. The low level driver will do the Z8530 and + DMA controller programming but not board specific magic. + + + Having initialised the device you can then call + z8530_init. This will probe the chip and + reset it into a known state. An identification sequence is then + run to identify the chip type. If the checks fail to pass the + function returns a non zero error code. Typically this indicates + that the port given is not valid. After this call the + type field of the z8530_dev structure is initialised to either + Z8530, Z85C30 or Z85230 according to the chip found. + + + Once you have called z8530_init you can also make use of the utility + function z8530_describe. This provides a + consistent reporting format for the Z8530 devices, and allows all + the drivers to provide consistent reporting. + + + + + Attaching Network Interfaces + + If you wish to use the network interface facilities of the driver, + then you need to attach a network device to each channel that is + present and in use. In addition to use the SyncPPP and Cisco HDLC + you need to follow some additional plumbing rules. They may seem + complex but a look at the example hostess_sv11 driver should + reassure you. + + + The network device used for each channel should be pointed to by + the netdevice field of each channel. The dev-> priv field of the + network device points to your private data - you will need to be + able to find your ppp device from this. In addition to use the + sync ppp layer the private data must start with a void * pointer + to the syncppp structures. + + + The way most drivers approach this particular problem is to + create a structure holding the Z8530 device definition and + put that and the syncppp pointer into the private field of + the network device. The network device fields of the channels + then point back to the network devices. The ppp_device can also + be put in the private structure conveniently. + + + If you wish to use the synchronous ppp then you need to attach + the syncppp layer to the network device. You should do this before + you register the network device. The + sppp_attach requires that the first void * + pointer in your private data is pointing to an empty struct + ppp_device. The function fills in the initial data for the + ppp/hdlc layer. + + + Before you register your network device you will also need to + provide suitable handlers for most of the network device callbacks. + See the network device documentation for more details on this. + + + + + Configuring And Activating The Port + + The Z85230 driver provides helper functions and tables to load the + port registers on the Z8530 chips. When programming the register + settings for a channel be aware that the documentation recommends + initialisation orders. Strange things happen when these are not + followed. + + + z8530_channel_load takes an array of + pairs of initialisation values in an array of u8 type. The first + value is the Z8530 register number. Add 16 to indicate the alternate + register bank on the later chips. The array is terminated by a 255. + + + The driver provides a pair of public tables. The + z8530_hdlc_kilostream table is for the UK 'Kilostream' service and + also happens to cover most other end host configurations. The + z8530_hdlc_kilostream_85230 table is the same configuration using + the enhancements of the 85230 chip. The configuration loaded is + standard NRZ encoded synchronous data with HDLC bitstuffing. All + of the timing is taken from the other end of the link. + + + When writing your own tables be aware that the driver internally + tracks register values. It may need to reload values. You should + therefore be sure to set registers 1-7, 9-11, 14 and 15 in all + configurations. Where the register settings depend on DMA selection + the driver will update the bits itself when you open or close. + Loading a new table with the interface open is not recommended. + + + There are three standard configurations supported by the core + code. In PIO mode the interface is programmed up to use + interrupt driven PIO. This places high demands on the host processor + to avoid latency. The driver is written to take account of latency + issues but it cannot avoid latencies caused by other drivers, + notably IDE in PIO mode. Because the drivers allocate buffers you + must also prevent MTU changes while the port is open. + + + Once the port is open it will call the rx_function of each channel + whenever a completed packet arrived. This is invoked from + interrupt context and passes you the channel and a network + buffer (struct sk_buff) holding the data. The data includes + the CRC bytes so most users will want to trim the last two + bytes before processing the data. This function is very timing + critical. When you wish to simply discard data the support + code provides the function z8530_null_rx + to discard the data. + + + To active PIO mode sending and receiving the + z8530_sync_open is called. This expects to be passed + the network device and the channel. Typically this is called from + your network device open callback. On a failure a non zero error + status is returned. The z8530_sync_close + function shuts down a PIO channel. This must be done before the + channel is opened again and before the driver shuts down + and unloads. + + + The ideal mode of operation is dual channel DMA mode. Here the + kernel driver will configure the board for DMA in both directions. + The driver also handles ISA DMA issues such as controller + programming and the memory range limit for you. This mode is + activated by calling the z8530_sync_dma_open + function. On failure a non zero error value is returned. + Once this mode is activated it can be shut down by calling the + z8530_sync_dma_close. You must call the close + function matching the open mode you used. + + + The final supported mode uses a single DMA channel to drive the + transmit side. As the Z85C30 has a larger FIFO on the receive + channel this tends to increase the maximum speed a little. + This is activated by calling the z8530_sync_txdma_open + . This returns a non zero error code on failure. The + z8530_sync_txdma_close function closes down + the Z8530 interface from this mode. + + + + + Network Layer Functions + + The Z8530 layer provides functions to queue packets for + transmission. The driver internally buffers the frame currently + being transmitted and one further frame (in order to keep back + to back transmission running). Any further buffering is up to + the caller. + + + The function z8530_queue_xmit takes a network + buffer in sk_buff format and queues it for transmission. The + caller must provide the entire packet with the exception of the + bitstuffing and CRC. This is normally done by the caller via + the syncppp interface layer. It returns 0 if the buffer has been + queued and non zero values for queue full. If the function accepts + the buffer it becomes property of the Z8530 layer and the caller + should not free it. + + + The function z8530_get_stats returns a pointer + to an internally maintained per interface statistics block. This + provides most of the interface code needed to implement the network + layer get_stats callback. + + + + + Porting The Z8530 Driver + + The Z8530 driver is written to be portable. In DMA mode it makes + assumptions about the use of ISA DMA. These are probably warranted + in most cases as the Z85230 in particular was designed to glue to PC + type machines. The PIO mode makes no real assumptions. + + + Should you need to retarget the Z8530 driver to another architecture + the only code that should need changing are the port I/O functions. + At the moment these assume PC I/O port accesses. This may not be + appropriate for all platforms. Replacing + z8530_read_port and z8530_write_port + is intended to be all that is required to port this + driver layer. + + + + + Known Bugs And Assumptions + + + Interrupt Locking + + + The locking in the driver is done via the global cli/sti lock. This + makes for relatively poor SMP performance. Switching this to use a + per device spin lock would probably materially improve performance. + + + + Occasional Failures + + + We have reports of occasional failures when run for very long + periods of time and the driver starts to receive junk frames. At + the moment the cause of this is not clear. + + + + + + + + + Public Functions Provided +!Edrivers/net/wan/z85230.c + + + + Internal Functions +!Idrivers/net/wan/z85230.c + + +
diff --git a/Documentation/IO-mapping.txt b/Documentation/IO-mapping.txt new file mode 100644 index 000000000000..86edb61bdee6 --- /dev/null +++ b/Documentation/IO-mapping.txt @@ -0,0 +1,208 @@ +[ NOTE: The virt_to_bus() and bus_to_virt() functions have been + superseded by the functionality provided by the PCI DMA + interface (see Documentation/DMA-mapping.txt). They continue + to be documented below for historical purposes, but new code + must not use them. --davidm 00/12/12 ] + +[ This is a mail message in response to a query on IO mapping, thus the + strange format for a "document" ] + +The AHA-1542 is a bus-master device, and your patch makes the driver give the +controller the physical address of the buffers, which is correct on x86 +(because all bus master devices see the physical memory mappings directly). + +However, on many setups, there are actually _three_ different ways of looking +at memory addresses, and in this case we actually want the third, the +so-called "bus address". + +Essentially, the three ways of addressing memory are (this is "real memory", +that is, normal RAM--see later about other details): + + - CPU untranslated. This is the "physical" address. Physical address + 0 is what the CPU sees when it drives zeroes on the memory bus. + + - CPU translated address. This is the "virtual" address, and is + completely internal to the CPU itself with the CPU doing the appropriate + translations into "CPU untranslated". + + - bus address. This is the address of memory as seen by OTHER devices, + not the CPU. Now, in theory there could be many different bus + addresses, with each device seeing memory in some device-specific way, but + happily most hardware designers aren't actually actively trying to make + things any more complex than necessary, so you can assume that all + external hardware sees the memory the same way. + +Now, on normal PCs the bus address is exactly the same as the physical +address, and things are very simple indeed. However, they are that simple +because the memory and the devices share the same address space, and that is +not generally necessarily true on other PCI/ISA setups. + +Now, just as an example, on the PReP (PowerPC Reference Platform), the +CPU sees a memory map something like this (this is from memory): + + 0-2 GB "real memory" + 2 GB-3 GB "system IO" (inb/out and similar accesses on x86) + 3 GB-4 GB "IO memory" (shared memory over the IO bus) + +Now, that looks simple enough. However, when you look at the same thing from +the viewpoint of the devices, you have the reverse, and the physical memory +address 0 actually shows up as address 2 GB for any IO master. + +So when the CPU wants any bus master to write to physical memory 0, it +has to give the master address 0x80000000 as the memory address. + +So, for example, depending on how the kernel is actually mapped on the +PPC, you can end up with a setup like this: + + physical address: 0 + virtual address: 0xC0000000 + bus address: 0x80000000 + +where all the addresses actually point to the same thing. It's just seen +through different translations.. + +Similarly, on the Alpha, the normal translation is + + physical address: 0 + virtual address: 0xfffffc0000000000 + bus address: 0x40000000 + +(but there are also Alphas where the physical address and the bus address +are the same). + +Anyway, the way to look up all these translations, you do + + #include + + phys_addr = virt_to_phys(virt_addr); + virt_addr = phys_to_virt(phys_addr); + bus_addr = virt_to_bus(virt_addr); + virt_addr = bus_to_virt(bus_addr); + +Now, when do you need these? + +You want the _virtual_ address when you are actually going to access that +pointer from the kernel. So you can have something like this: + + /* + * this is the hardware "mailbox" we use to communicate with + * the controller. The controller sees this directly. + */ + struct mailbox { + __u32 status; + __u32 bufstart; + __u32 buflen; + .. + } mbox; + + unsigned char * retbuffer; + + /* get the address from the controller */ + retbuffer = bus_to_virt(mbox.bufstart); + switch (retbuffer[0]) { + case STATUS_OK: + ... + +on the other hand, you want the bus address when you have a buffer that +you want to give to the controller: + + /* ask the controller to read the sense status into "sense_buffer" */ + mbox.bufstart = virt_to_bus(&sense_buffer); + mbox.buflen = sizeof(sense_buffer); + mbox.status = 0; + notify_controller(&mbox); + +And you generally _never_ want to use the physical address, because you can't +use that from the CPU (the CPU only uses translated virtual addresses), and +you can't use it from the bus master. + +So why do we care about the physical address at all? We do need the physical +address in some cases, it's just not very often in normal code. The physical +address is needed if you use memory mappings, for example, because the +"remap_pfn_range()" mm function wants the physical address of the memory to +be remapped as measured in units of pages, a.k.a. the pfn (the memory +management layer doesn't know about devices outside the CPU, so it +shouldn't need to know about "bus addresses" etc). + +NOTE NOTE NOTE! The above is only one part of the whole equation. The above +only talks about "real memory", that is, CPU memory (RAM). + +There is a completely different type of memory too, and that's the "shared +memory" on the PCI or ISA bus. That's generally not RAM (although in the case +of a video graphics card it can be normal DRAM that is just used for a frame +buffer), but can be things like a packet buffer in a network card etc. + +This memory is called "PCI memory" or "shared memory" or "IO memory" or +whatever, and there is only one way to access it: the readb/writeb and +related functions. You should never take the address of such memory, because +there is really nothing you can do with such an address: it's not +conceptually in the same memory space as "real memory" at all, so you cannot +just dereference a pointer. (Sadly, on x86 it _is_ in the same memory space, +so on x86 it actually works to just deference a pointer, but it's not +portable). + +For such memory, you can do things like + + - reading: + /* + * read first 32 bits from ISA memory at 0xC0000, aka + * C000:0000 in DOS terms + */ + unsigned int signature = isa_readl(0xC0000); + + - remapping and writing: + /* + * remap framebuffer PCI memory area at 0xFC000000, + * size 1MB, so that we can access it: We can directly + * access only the 640k-1MB area, so anything else + * has to be remapped. + */ + char * baseptr = ioremap(0xFC000000, 1024*1024); + + /* write a 'A' to the offset 10 of the area */ + writeb('A',baseptr+10); + + /* unmap when we unload the driver */ + iounmap(baseptr); + + - copying and clearing: + /* get the 6-byte Ethernet address at ISA address E000:0040 */ + memcpy_fromio(kernel_buffer, 0xE0040, 6); + /* write a packet to the driver */ + memcpy_toio(0xE1000, skb->data, skb->len); + /* clear the frame buffer */ + memset_io(0xA0000, 0, 0x10000); + +OK, that just about covers the basics of accessing IO portably. Questions? +Comments? You may think that all the above is overly complex, but one day you +might find yourself with a 500 MHz Alpha in front of you, and then you'll be +happy that your driver works ;) + +Note that kernel versions 2.0.x (and earlier) mistakenly called the +ioremap() function "vremap()". ioremap() is the proper name, but I +didn't think straight when I wrote it originally. People who have to +support both can do something like: + + /* support old naming silliness */ + #if LINUX_VERSION_CODE < 0x020100 + #define ioremap vremap + #define iounmap vfree + #endif + +at the top of their source files, and then they can use the right names +even on 2.0.x systems. + +And the above sounds worse than it really is. Most real drivers really +don't do all that complex things (or rather: the complexity is not so +much in the actual IO accesses as in error handling and timeouts etc). +It's generally not hard to fix drivers, and in many cases the code +actually looks better afterwards: + + unsigned long signature = *(unsigned int *) 0xC0000; + vs + unsigned long signature = readl(0xC0000); + +I think the second version actually is more readable, no? + + Linus + diff --git a/Documentation/IPMI.txt b/Documentation/IPMI.txt new file mode 100644 index 000000000000..90d10e708ca3 --- /dev/null +++ b/Documentation/IPMI.txt @@ -0,0 +1,534 @@ + + The Linux IPMI Driver + --------------------- + Corey Minyard + + + +The Intelligent Platform Management Interface, or IPMI, is a +standard for controlling intelligent devices that monitor a system. +It provides for dynamic discovery of sensors in the system and the +ability to monitor the sensors and be informed when the sensor's +values change or go outside certain boundaries. It also has a +standardized database for field-replacable units (FRUs) and a watchdog +timer. + +To use this, you need an interface to an IPMI controller in your +system (called a Baseboard Management Controller, or BMC) and +management software that can use the IPMI system. + +This document describes how to use the IPMI driver for Linux. If you +are not familiar with IPMI itself, see the web site at +http://www.intel.com/design/servers/ipmi/index.htm. IPMI is a big +subject and I can't cover it all here! + +Configuration +------------- + +The LinuxIPMI driver is modular, which means you have to pick several +things to have it work right depending on your hardware. Most of +these are available in the 'Character Devices' menu. + +No matter what, you must pick 'IPMI top-level message handler' to use +IPMI. What you do beyond that depends on your needs and hardware. + +The message handler does not provide any user-level interfaces. +Kernel code (like the watchdog) can still use it. If you need access +from userland, you need to select 'Device interface for IPMI' if you +want access through a device driver. Another interface is also +available, you may select 'IPMI sockets' in the 'Networking Support' +main menu. This provides a socket interface to IPMI. You may select +both of these at the same time, they will both work together. + +The driver interface depends on your hardware. If you have a board +with a standard interface (These will generally be either "KCS", +"SMIC", or "BT", consult your hardware manual), choose the 'IPMI SI +handler' option. A driver also exists for direct I2C access to the +IPMI management controller. Some boards support this, but it is +unknown if it will work on every board. For this, choose 'IPMI SMBus +handler', but be ready to try to do some figuring to see if it will +work. + +There is also a KCS-only driver interface supplied, but it is +depracated in favor of the SI interface. + +You should generally enable ACPI on your system, as systems with IPMI +should have ACPI tables describing them. + +If you have a standard interface and the board manufacturer has done +their job correctly, the IPMI controller should be automatically +detect (via ACPI or SMBIOS tables) and should just work. Sadly, many +boards do not have this information. The driver attempts standard +defaults, but they may not work. If you fall into this situation, you +need to read the section below named 'The SI Driver' on how to +hand-configure your system. + +IPMI defines a standard watchdog timer. You can enable this with the +'IPMI Watchdog Timer' config option. If you compile the driver into +the kernel, then via a kernel command-line option you can have the +watchdog timer start as soon as it intitializes. It also have a lot +of other options, see the 'Watchdog' section below for more details. +Note that you can also have the watchdog continue to run if it is +closed (by default it is disabled on close). Go into the 'Watchdog +Cards' menu, enable 'Watchdog Timer Support', and enable the option +'Disable watchdog shutdown on close'. + + +Basic Design +------------ + +The Linux IPMI driver is designed to be very modular and flexible, you +only need to take the pieces you need and you can use it in many +different ways. Because of that, it's broken into many chunks of +code. These chunks are: + +ipmi_msghandler - This is the central piece of software for the IPMI +system. It handles all messages, message timing, and responses. The +IPMI users tie into this, and the IPMI physical interfaces (called +System Management Interfaces, or SMIs) also tie in here. This +provides the kernelland interface for IPMI, but does not provide an +interface for use by application processes. + +ipmi_devintf - This provides a userland IOCTL interface for the IPMI +driver, each open file for this device ties in to the message handler +as an IPMI user. + +ipmi_si - A driver for various system interfaces. This supports +KCS, SMIC, and may support BT in the future. Unless you have your own +custom interface, you probably need to use this. + +ipmi_smb - A driver for accessing BMCs on the SMBus. It uses the +I2C kernel driver's SMBus interfaces to send and receive IPMI messages +over the SMBus. + +af_ipmi - A network socket interface to IPMI. This doesn't take up +a character device in your system. + +Note that the KCS-only interface ahs been removed. + +Much documentation for the interface is in the include files. The +IPMI include files are: + +net/af_ipmi.h - Contains the socket interface. + +linux/ipmi.h - Contains the user interface and IOCTL interface for IPMI. + +linux/ipmi_smi.h - Contains the interface for system management interfaces +(things that interface to IPMI controllers) to use. + +linux/ipmi_msgdefs.h - General definitions for base IPMI messaging. + + +Addressing +---------- + +The IPMI addressing works much like IP addresses, you have an overlay +to handle the different address types. The overlay is: + + struct ipmi_addr + { + int addr_type; + short channel; + char data[IPMI_MAX_ADDR_SIZE]; + }; + +The addr_type determines what the address really is. The driver +currently understands two different types of addresses. + +"System Interface" addresses are defined as: + + struct ipmi_system_interface_addr + { + int addr_type; + short channel; + }; + +and the type is IPMI_SYSTEM_INTERFACE_ADDR_TYPE. This is used for talking +straight to the BMC on the current card. The channel must be +IPMI_BMC_CHANNEL. + +Messages that are destined to go out on the IPMB bus use the +IPMI_IPMB_ADDR_TYPE address type. The format is + + struct ipmi_ipmb_addr + { + int addr_type; + short channel; + unsigned char slave_addr; + unsigned char lun; + }; + +The "channel" here is generally zero, but some devices support more +than one channel, it corresponds to the channel as defined in the IPMI +spec. + + +Messages +-------- + +Messages are defined as: + +struct ipmi_msg +{ + unsigned char netfn; + unsigned char lun; + unsigned char cmd; + unsigned char *data; + int data_len; +}; + +The driver takes care of adding/stripping the header information. The +data portion is just the data to be send (do NOT put addressing info +here) or the response. Note that the completion code of a response is +the first item in "data", it is not stripped out because that is how +all the messages are defined in the spec (and thus makes counting the +offsets a little easier :-). + +When using the IOCTL interface from userland, you must provide a block +of data for "data", fill it, and set data_len to the length of the +block of data, even when receiving messages. Otherwise the driver +will have no place to put the message. + +Messages coming up from the message handler in kernelland will come in +as: + + struct ipmi_recv_msg + { + struct list_head link; + + /* The type of message as defined in the "Receive Types" + defines above. */ + int recv_type; + + ipmi_user_t *user; + struct ipmi_addr addr; + long msgid; + struct ipmi_msg msg; + + /* Call this when done with the message. It will presumably free + the message and do any other necessary cleanup. */ + void (*done)(struct ipmi_recv_msg *msg); + + /* Place-holder for the data, don't make any assumptions about + the size or existence of this, since it may change. */ + unsigned char msg_data[IPMI_MAX_MSG_LENGTH]; + }; + +You should look at the receive type and handle the message +appropriately. + + +The Upper Layer Interface (Message Handler) +------------------------------------------- + +The upper layer of the interface provides the users with a consistent +view of the IPMI interfaces. It allows multiple SMI interfaces to be +addressed (because some boards actually have multiple BMCs on them) +and the user should not have to care what type of SMI is below them. + + +Creating the User + +To user the message handler, you must first create a user using +ipmi_create_user. The interface number specifies which SMI you want +to connect to, and you must supply callback functions to be called +when data comes in. The callback function can run at interrupt level, +so be careful using the callbacks. This also allows to you pass in a +piece of data, the handler_data, that will be passed back to you on +all calls. + +Once you are done, call ipmi_destroy_user() to get rid of the user. + +From userland, opening the device automatically creates a user, and +closing the device automatically destroys the user. + + +Messaging + +To send a message from kernel-land, the ipmi_request() call does +pretty much all message handling. Most of the parameter are +self-explanatory. However, it takes a "msgid" parameter. This is NOT +the sequence number of messages. It is simply a long value that is +passed back when the response for the message is returned. You may +use it for anything you like. + +Responses come back in the function pointed to by the ipmi_recv_hndl +field of the "handler" that you passed in to ipmi_create_user(). +Remember again, these may be running at interrupt level. Remember to +look at the receive type, too. + +From userland, you fill out an ipmi_req_t structure and use the +IPMICTL_SEND_COMMAND ioctl. For incoming stuff, you can use select() +or poll() to wait for messages to come in. However, you cannot use +read() to get them, you must call the IPMICTL_RECEIVE_MSG with the +ipmi_recv_t structure to actually get the message. Remember that you +must supply a pointer to a block of data in the msg.data field, and +you must fill in the msg.data_len field with the size of the data. +This gives the receiver a place to actually put the message. + +If the message cannot fit into the data you provide, you will get an +EMSGSIZE error and the driver will leave the data in the receive +queue. If you want to get it and have it truncate the message, us +the IPMICTL_RECEIVE_MSG_TRUNC ioctl. + +When you send a command (which is defined by the lowest-order bit of +the netfn per the IPMI spec) on the IPMB bus, the driver will +automatically assign the sequence number to the command and save the +command. If the response is not receive in the IPMI-specified 5 +seconds, it will generate a response automatically saying the command +timed out. If an unsolicited response comes in (if it was after 5 +seconds, for instance), that response will be ignored. + +In kernelland, after you receive a message and are done with it, you +MUST call ipmi_free_recv_msg() on it, or you will leak messages. Note +that you should NEVER mess with the "done" field of a message, that is +required to properly clean up the message. + +Note that when sending, there is an ipmi_request_supply_msgs() call +that lets you supply the smi and receive message. This is useful for +pieces of code that need to work even if the system is out of buffers +(the watchdog timer uses this, for instance). You supply your own +buffer and own free routines. This is not recommended for normal use, +though, since it is tricky to manage your own buffers. + + +Events and Incoming Commands + +The driver takes care of polling for IPMI events and receiving +commands (commands are messages that are not responses, they are +commands that other things on the IPMB bus have sent you). To receive +these, you must register for them, they will not automatically be sent +to you. + +To receive events, you must call ipmi_set_gets_events() and set the +"val" to non-zero. Any events that have been received by the driver +since startup will immediately be delivered to the first user that +registers for events. After that, if multiple users are registered +for events, they will all receive all events that come in. + +For receiving commands, you have to individually register commands you +want to receive. Call ipmi_register_for_cmd() and supply the netfn +and command name for each command you want to receive. Only one user +may be registered for each netfn/cmd, but different users may register +for different commands. + +From userland, equivalent IOCTLs are provided to do these functions. + + +The Lower Layer (SMI) Interface +------------------------------- + +As mentioned before, multiple SMI interfaces may be registered to the +message handler, each of these is assigned an interface number when +they register with the message handler. They are generally assigned +in the order they register, although if an SMI unregisters and then +another one registers, all bets are off. + +The ipmi_smi.h defines the interface for management interfaces, see +that for more details. + + +The SI Driver +------------- + +The SI driver allows up to 4 KCS or SMIC interfaces to be configured +in the system. By default, scan the ACPI tables for interfaces, and +if it doesn't find any the driver will attempt to register one KCS +interface at the spec-specified I/O port 0xca2 without interrupts. +You can change this at module load time (for a module) with: + + modprobe ipmi_si.o type=,.... + ports=,... addrs=,... + irqs=,... trydefaults=[0|1] + regspacings=,,... regsizes=,,... + regshifts=,,... + slave_addrs=,,... + +Each of these except si_trydefaults is a list, the first item for the +first interface, second item for the second interface, etc. + +The si_type may be either "kcs", "smic", or "bt". If you leave it blank, it +defaults to "kcs". + +If you specify si_addrs as non-zero for an interface, the driver will +use the memory address given as the address of the device. This +overrides si_ports. + +If you specify si_ports as non-zero for an interface, the driver will +use the I/O port given as the device address. + +If you specify si_irqs as non-zero for an interface, the driver will +attempt to use the given interrupt for the device. + +si_trydefaults sets whether the standard IPMI interface at 0xca2 and +any interfaces specified by ACPE are tried. By default, the driver +tries it, set this value to zero to turn this off. + +The next three parameters have to do with register layout. The +registers used by the interfaces may not appear at successive +locations and they may not be in 8-bit registers. These parameters +allow the layout of the data in the registers to be more precisely +specified. + +The regspacings parameter give the number of bytes between successive +register start addresses. For instance, if the regspacing is set to 4 +and the start address is 0xca2, then the address for the second +register would be 0xca6. This defaults to 1. + +The regsizes parameter gives the size of a register, in bytes. The +data used by IPMI is 8-bits wide, but it may be inside a larger +register. This parameter allows the read and write type to specified. +It may be 1, 2, 4, or 8. The default is 1. + +Since the register size may be larger than 32 bits, the IPMI data may not +be in the lower 8 bits. The regshifts parameter give the amount to shift +the data to get to the actual IPMI data. + +The slave_addrs specifies the IPMI address of the local BMC. This is +usually 0x20 and the driver defaults to that, but in case it's not, it +can be specified when the driver starts up. + +When compiled into the kernel, the addresses can be specified on the +kernel command line as: + + ipmi_si.type=,... + ipmi_si.ports=,... ipmi_si.addrs=,... + ipmi_si.irqs=,... ipmi_si.trydefaults=[0|1] + ipmi_si.regspacings=,,... + ipmi_si.regsizes=,,... + ipmi_si.regshifts=,,... + ipmi_si.slave_addrs=,,... + +It works the same as the module parameters of the same names. + +By default, the driver will attempt to detect any device specified by +ACPI, and if none of those then a KCS device at the spec-specified +0xca2. If you want to turn this off, set the "trydefaults" option to +false. + +If you have high-res timers compiled into the kernel, the driver will +use them to provide much better performance. Note that if you do not +have high-res timers enabled in the kernel and you don't have +interrupts enabled, the driver will run VERY slowly. Don't blame me, +these interfaces suck. + + +The SMBus Driver +---------------- + +The SMBus driver allows up to 4 SMBus devices to be configured in the +system. By default, the driver will register any SMBus interfaces it finds +in the I2C address range of 0x20 to 0x4f on any adapter. You can change this +at module load time (for a module) with: + + modprobe ipmi_smb.o + addr=,[,,[,...]] + dbg=,... + [defaultprobe=0] [dbg_probe=1] + +The addresses are specified in pairs, the first is the adapter ID and the +second is the I2C address on that adapter. + +The debug flags are bit flags for each BMC found, they are: +IPMI messages: 1, driver state: 2, timing: 4, I2C probe: 8 + +Setting smb_defaultprobe to zero disabled the default probing of SMBus +interfaces at address range 0x20 to 0x4f. This means that only the +BMCs specified on the smb_addr line will be detected. + +Setting smb_dbg_probe to 1 will enable debugging of the probing and +detection process for BMCs on the SMBusses. + +Discovering the IPMI compilant BMC on the SMBus can cause devices +on the I2C bus to fail. The SMBus driver writes a "Get Device ID" IPMI +message as a block write to the I2C bus and waits for a response. +This action can be detrimental to some I2C devices. It is highly recommended +that the known I2c address be given to the SMBus driver in the smb_addr +parameter. The default adrress range will not be used when a smb_addr +parameter is provided. + +When compiled into the kernel, the addresses can be specified on the +kernel command line as: + + ipmb_smb.addr=,[,,[,...]] + ipmi_smb.dbg=,... + ipmi_smb.defaultprobe=0 ipmi_smb.dbg_probe=1 + +These are the same options as on the module command line. + +Note that you might need some I2C changes if CONFIG_IPMI_PANIC_EVENT +is enabled along with this, so the I2C driver knows to run to +completion during sending a panic event. + + +Other Pieces +------------ + +Watchdog +-------- + +A watchdog timer is provided that implements the Linux-standard +watchdog timer interface. It has three module parameters that can be +used to control it: + + modprobe ipmi_watchdog timeout= pretimeout= action= + preaction= preop= start_now=x + nowayout=x + +The timeout is the number of seconds to the action, and the pretimeout +is the amount of seconds before the reset that the pre-timeout panic will +occur (if pretimeout is zero, then pretimeout will not be enabled). Note +that the pretimeout is the time before the final timeout. So if the +timeout is 50 seconds and the pretimeout is 10 seconds, then the pretimeout +will occur in 40 second (10 seconds before the timeout). + +The action may be "reset", "power_cycle", or "power_off", and +specifies what to do when the timer times out, and defaults to +"reset". + +The preaction may be "pre_smi" for an indication through the SMI +interface, "pre_int" for an indication through the SMI with an +interrupts, and "pre_nmi" for a NMI on a preaction. This is how +the driver is informed of the pretimeout. + +The preop may be set to "preop_none" for no operation on a pretimeout, +"preop_panic" to set the preoperation to panic, or "preop_give_data" +to provide data to read from the watchdog device when the pretimeout +occurs. A "pre_nmi" setting CANNOT be used with "preop_give_data" +because you can't do data operations from an NMI. + +When preop is set to "preop_give_data", one byte comes ready to read +on the device when the pretimeout occurs. Select and fasync work on +the device, as well. + +If start_now is set to 1, the watchdog timer will start running as +soon as the driver is loaded. + +If nowayout is set to 1, the watchdog timer will not stop when the +watchdog device is closed. The default value of nowayout is true +if the CONFIG_WATCHDOG_NOWAYOUT option is enabled, or false if not. + +When compiled into the kernel, the kernel command line is available +for configuring the watchdog: + + ipmi_watchdog.timeout= ipmi_watchdog.pretimeout= + ipmi_watchdog.action= + ipmi_watchdog.preaction= + ipmi_watchdog.preop= + ipmi_watchdog.start_now=x + ipmi_watchdog.nowayout=x + +The options are the same as the module parameter options. + +The watchdog will panic and start a 120 second reset timeout if it +gets a pre-action. During a panic or a reboot, the watchdog will +start a 120 timer if it is running to make sure the reboot occurs. + +Note that if you use the NMI preaction for the watchdog, you MUST +NOT use nmi watchdog mode 1. If you use the NMI watchdog, you +must use mode 2. + +Once you open the watchdog timer, you must write a 'V' character to the +device to close it, or the timer will not stop. This is a new semantic +for the driver, but makes it consistent with the rest of the watchdog +drivers in Linux. diff --git a/Documentation/IRQ-affinity.txt b/Documentation/IRQ-affinity.txt new file mode 100644 index 000000000000..938d7dd05490 --- /dev/null +++ b/Documentation/IRQ-affinity.txt @@ -0,0 +1,37 @@ + +SMP IRQ affinity, started by Ingo Molnar + + +/proc/irq/IRQ#/smp_affinity specifies which target CPUs are permitted +for a given IRQ source. It's a bitmask of allowed CPUs. It's not allowed +to turn off all CPUs, and if an IRQ controller does not support IRQ +affinity then the value will not change from the default 0xffffffff. + +Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting +the IRQ to CPU4-7 (this is an 8-CPU SMP box): + +[root@moon 44]# cat smp_affinity +ffffffff +[root@moon 44]# echo 0f > smp_affinity +[root@moon 44]# cat smp_affinity +0000000f +[root@moon 44]# ping -f h +PING hell (195.4.7.3): 56 data bytes +... +--- hell ping statistics --- +6029 packets transmitted, 6027 packets received, 0% packet loss +round-trip min/avg/max = 0.1/0.1/0.4 ms +[root@moon 44]# cat /proc/interrupts | grep 44: + 44: 0 1785 1785 1783 1783 1 +1 0 IO-APIC-level eth1 +[root@moon 44]# echo f0 > smp_affinity +[root@moon 44]# ping -f h +PING hell (195.4.7.3): 56 data bytes +.. +--- hell ping statistics --- +2779 packets transmitted, 2777 packets received, 0% packet loss +round-trip min/avg/max = 0.1/0.5/585.4 ms +[root@moon 44]# cat /proc/interrupts | grep 44: + 44: 1068 1785 1785 1784 1784 1069 1070 1069 IO-APIC-level eth1 +[root@moon 44]# + diff --git a/Documentation/MSI-HOWTO.txt b/Documentation/MSI-HOWTO.txt new file mode 100644 index 000000000000..d5032eb480aa --- /dev/null +++ b/Documentation/MSI-HOWTO.txt @@ -0,0 +1,503 @@ + The MSI Driver Guide HOWTO + Tom L Nguyen tom.l.nguyen@intel.com + 10/03/2003 + Revised Feb 12, 2004 by Martine Silbermann + email: Martine.Silbermann@hp.com + Revised Jun 25, 2004 by Tom L Nguyen + +1. About this guide + +This guide describes the basics of Message Signaled Interrupts (MSI), +the advantages of using MSI over traditional interrupt mechanisms, +and how to enable your driver to use MSI or MSI-X. Also included is +a Frequently Asked Questions. + +2. Copyright 2003 Intel Corporation + +3. What is MSI/MSI-X? + +Message Signaled Interrupt (MSI), as described in the PCI Local Bus +Specification Revision 2.3 or latest, is an optional feature, and a +required feature for PCI Express devices. MSI enables a device function +to request service by sending an Inbound Memory Write on its PCI bus to +the FSB as a Message Signal Interrupt transaction. Because MSI is +generated in the form of a Memory Write, all transaction conditions, +such as a Retry, Master-Abort, Target-Abort or normal completion, are +supported. + +A PCI device that supports MSI must also support pin IRQ assertion +interrupt mechanism to provide backward compatibility for systems that +do not support MSI. In Systems, which support MSI, the bus driver is +responsible for initializing the message address and message data of +the device function's MSI/MSI-X capability structure during device +initial configuration. + +An MSI capable device function indicates MSI support by implementing +the MSI/MSI-X capability structure in its PCI capability list. The +device function may implement both the MSI capability structure and +the MSI-X capability structure; however, the bus driver should not +enable both. + +The MSI capability structure contains Message Control register, +Message Address register and Message Data register. These registers +provide the bus driver control over MSI. The Message Control register +indicates the MSI capability supported by the device. The Message +Address register specifies the target address and the Message Data +register specifies the characteristics of the message. To request +service, the device function writes the content of the Message Data +register to the target address. The device and its software driver +are prohibited from writing to these registers. + +The MSI-X capability structure is an optional extension to MSI. It +uses an independent and separate capability structure. There are +some key advantages to implementing the MSI-X capability structure +over the MSI capability structure as described below. + + - Support a larger maximum number of vectors per function. + + - Provide the ability for system software to configure + each vector with an independent message address and message + data, specified by a table that resides in Memory Space. + + - MSI and MSI-X both support per-vector masking. Per-vector + masking is an optional extension of MSI but a required + feature for MSI-X. Per-vector masking provides the kernel + the ability to mask/unmask MSI when servicing its software + interrupt service routing handler. If per-vector masking is + not supported, then the device driver should provide the + hardware/software synchronization to ensure that the device + generates MSI when the driver wants it to do so. + +4. Why use MSI? + +As a benefit the simplification of board design, MSI allows board +designers to remove out of band interrupt routing. MSI is another +step towards a legacy-free environment. + +Due to increasing pressure on chipset and processor packages to +reduce pin count, the need for interrupt pins is expected to +diminish over time. Devices, due to pin constraints, may implement +messages to increase performance. + +PCI Express endpoints uses INTx emulation (in-band messages) instead +of IRQ pin assertion. Using INTx emulation requires interrupt +sharing among devices connected to the same node (PCI bridge) while +MSI is unique (non-shared) and does not require BIOS configuration +support. As a result, the PCI Express technology requires MSI +support for better interrupt performance. + +Using MSI enables the device functions to support two or more +vectors, which can be configured to target different CPU's to +increase scalability. + +5. Configuring a driver to use MSI/MSI-X + +By default, the kernel will not enable MSI/MSI-X on all devices that +support this capability. The CONFIG_PCI_MSI kernel option +must be selected to enable MSI/MSI-X support. + +5.1 Including MSI/MSI-X support into the kernel + +To allow MSI/MSI-X capable device drivers to selectively enable +MSI/MSI-X (using pci_enable_msi()/pci_enable_msix() as described +below), the VECTOR based scheme needs to be enabled by setting +CONFIG_PCI_MSI during kernel config. + +Since the target of the inbound message is the local APIC, providing +CONFIG_X86_LOCAL_APIC must be enabled as well as CONFIG_PCI_MSI. + +5.2 Configuring for MSI support + +Due to the non-contiguous fashion in vector assignment of the +existing Linux kernel, this version does not support multiple +messages regardless of a device function is capable of supporting +more than one vector. To enable MSI on a device function's MSI +capability structure requires a device driver to call the function +pci_enable_msi() explicitly. + +5.2.1 API pci_enable_msi + +int pci_enable_msi(struct pci_dev *dev) + +With this new API, any existing device driver, which like to have +MSI enabled on its device function, must call this API to enable MSI +A successful call will initialize the MSI capability structure +with ONE vector, regardless of whether a device function is +capable of supporting multiple messages. This vector replaces the +pre-assigned dev->irq with a new MSI vector. To avoid the conflict +of new assigned vector with existing pre-assigned vector requires +a device driver to call this API before calling request_irq(). + +5.2.2 API pci_disable_msi + +void pci_disable_msi(struct pci_dev *dev) + +This API should always be used to undo the effect of pci_enable_msi() +when a device driver is unloading. This API restores dev->irq with +the pre-assigned IOAPIC vector and switches a device's interrupt +mode to PCI pin-irq assertion/INTx emulation mode. + +Note that a device driver should always call free_irq() on MSI vector +it has done request_irq() on before calling this API. Failure to do +so results a BUG_ON() and a device will be left with MSI enabled and +leaks its vector. + +5.2.3 MSI mode vs. legacy mode diagram + +The below diagram shows the events, which switches the interrupt +mode on the MSI-capable device function between MSI mode and +PIN-IRQ assertion mode. + + ------------ pci_enable_msi ------------------------ + | | <=============== | | + | MSI MODE | | PIN-IRQ ASSERTION MODE | + | | ===============> | | + ------------ pci_disable_msi ------------------------ + + +Figure 1.0 MSI Mode vs. Legacy Mode + +In Figure 1.0, a device operates by default in legacy mode. Legacy +in this context means PCI pin-irq assertion or PCI-Express INTx +emulation. A successful MSI request (using pci_enable_msi()) switches +a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector +stored in dev->irq will be saved by the PCI subsystem and a new +assigned MSI vector will replace dev->irq. + +To return back to its default mode, a device driver should always call +pci_disable_msi() to undo the effect of pci_enable_msi(). Note that a +device driver should always call free_irq() on MSI vector it has done +request_irq() on before calling pci_disable_msi(). Failure to do so +results a BUG_ON() and a device will be left with MSI enabled and +leaks its vector. Otherwise, the PCI subsystem restores a device's +dev->irq with a pre-assigned IOAPIC vector and marks released +MSI vector as unused. + +Once being marked as unused, there is no guarantee that the PCI +subsystem will reserve this MSI vector for a device. Depending on +the availability of current PCI vector resources and the number of +MSI/MSI-X requests from other drivers, this MSI may be re-assigned. + +For the case where the PCI subsystem re-assigned this MSI vector +another driver, a request to switching back to MSI mode may result +in being assigned a different MSI vector or a failure if no more +vectors are available. + +5.3 Configuring for MSI-X support + +Due to the ability of the system software to configure each vector of +the MSI-X capability structure with an independent message address +and message data, the non-contiguous fashion in vector assignment of +the existing Linux kernel has no impact on supporting multiple +messages on an MSI-X capable device functions. To enable MSI-X on +a device function's MSI-X capability structure requires its device +driver to call the function pci_enable_msix() explicitly. + +The function pci_enable_msix(), once invoked, enables either +all or nothing, depending on the current availability of PCI vector +resources. If the PCI vector resources are available for the number +of vectors requested by a device driver, this function will configure +the MSI-X table of the MSI-X capability structure of a device with +requested messages. To emphasize this reason, for example, a device +may be capable for supporting the maximum of 32 vectors while its +software driver usually may request 4 vectors. It is recommended +that the device driver should call this function once during the +initialization phase of the device driver. + +Unlike the function pci_enable_msi(), the function pci_enable_msix() +does not replace the pre-assigned IOAPIC dev->irq with a new MSI +vector because the PCI subsystem writes the 1:1 vector-to-entry mapping +into the field vector of each element contained in a second argument. +Note that the pre-assigned IO-APIC dev->irq is valid only if the device +operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt of +using dev->irq by the device driver to request for interrupt service +may result unpredictabe behavior. + +For each MSI-X vector granted, a device driver is responsible to call +other functions like request_irq(), enable_irq(), etc. to enable +this vector with its corresponding interrupt service handler. It is +a device driver's choice to assign all vectors with the same +interrupt service handler or each vector with a unique interrupt +service handler. + +5.3.1 Handling MMIO address space of MSI-X Table + +The PCI 3.0 specification has implementation notes that MMIO address +space for a device's MSI-X structure should be isolated so that the +software system can set different page for controlling accesses to +the MSI-X structure. The implementation of MSI patch requires the PCI +subsystem, not a device driver, to maintain full control of the MSI-X +table/MSI-X PBA and MMIO address space of the MSI-X table/MSI-X PBA. +A device driver is prohibited from requesting the MMIO address space +of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem will fail +enabling MSI-X on its hardware device when it calls the function +pci_enable_msix(). + +5.3.2 Handling MSI-X allocation + +Determining the number of MSI-X vectors allocated to a function is +dependent on the number of MSI capable devices and MSI-X capable +devices populated in the system. The policy of allocating MSI-X +vectors to a function is defined as the following: + +#of MSI-X vectors allocated to a function = (x - y)/z where + +x = The number of available PCI vector resources by the time + the device driver calls pci_enable_msix(). The PCI vector + resources is the sum of the number of unassigned vectors + (new) and the number of released vectors when any MSI/MSI-X + device driver switches its hardware device back to a legacy + mode or is hot-removed. The number of unassigned vectors + may exclude some vectors reserved, as defined in parameter + NR_HP_RESERVED_VECTORS, for the case where the system is + capable of supporting hot-add/hot-remove operations. Users + may change the value defined in NR_HR_RESERVED_VECTORS to + meet their specific needs. + +y = The number of MSI capable devices populated in the system. + This policy ensures that each MSI capable device has its + vector reserved to avoid the case where some MSI-X capable + drivers may attempt to claim all available vector resources. + +z = The number of MSI-X capable devices pupulated in the system. + This policy ensures that maximum (x - y) is distributed + evenly among MSI-X capable devices. + +Note that the PCI subsystem scans y and z during a bus enumeration. +When the PCI subsystem completes configuring MSI/MSI-X capability +structure of a device as requested by its device driver, y/z is +decremented accordingly. + +5.3.3 Handling MSI-X shortages + +For the case where fewer MSI-X vectors are allocated to a function +than requested, the function pci_enable_msix() will return the +maximum number of MSI-X vectors available to the caller. A device +driver may re-send its request with fewer or equal vectors indicated +in a return. For example, if a device driver requests 5 vectors, but +the number of available vectors is 3 vectors, a value of 3 will be a +return as a result of pci_enable_msix() call. A function could be +designed for its driver to use only 3 MSI-X table entries as +different combinations as ABC--, A-B-C, A--CB, etc. Note that this +patch does not support multiple entries with the same vector. Such +attempt by a device driver to use 5 MSI-X table entries with 3 vectors +as ABBCC, AABCC, BCCBA, etc will result as a failure by the function +pci_enable_msix(). Below are the reasons why supporting multiple +entries with the same vector is an undesirable solution. + + - The PCI subsystem can not determine which entry, which + generated the message, to mask/unmask MSI while handling + software driver ISR. Attempting to walk through all MSI-X + table entries (2048 max) to mask/unmask any match vector + is an undesirable solution. + + - Walk through all MSI-X table entries (2048 max) to handle + SMP affinity of any match vector is an undesirable solution. + +5.3.4 API pci_enable_msix + +int pci_enable_msix(struct pci_dev *dev, u32 *entries, int nvec) + +This API enables a device driver to request the PCI subsystem +for enabling MSI-X messages on its hardware device. Depending on +the availability of PCI vectors resources, the PCI subsystem enables +either all or nothing. + +Argument dev points to the device (pci_dev) structure. + +Argument entries is a pointer of unsigned integer type. The number of +elements is indicated in argument nvec. The content of each element +will be mapped to the following struct defined in /driver/pci/msi.h. + +struct msix_entry { + u16 vector; /* kernel uses to write alloc vector */ + u16 entry; /* driver uses to specify entry */ +}; + +A device driver is responsible for initializing the field entry of +each element with unique entry supported by MSI-X table. Otherwise, +-EINVAL will be returned as a result. A successful return of zero +indicates the PCI subsystem completes initializing each of requested +entries of the MSI-X table with message address and message data. +Last but not least, the PCI subsystem will write the 1:1 +vector-to-entry mapping into the field vector of each element. A +device driver is responsible of keeping track of allocated MSI-X +vectors in its internal data structure. + +Argument nvec is an integer indicating the number of messages +requested. + +A return of zero indicates that the number of MSI-X vectors is +successfully allocated. A return of greater than zero indicates +MSI-X vector shortage. Or a return of less than zero indicates +a failure. This failure may be a result of duplicate entries +specified in second argument, or a result of no available vector, +or a result of failing to initialize MSI-X table entries. + +5.3.5 API pci_disable_msix + +void pci_disable_msix(struct pci_dev *dev) + +This API should always be used to undo the effect of pci_enable_msix() +when a device driver is unloading. Note that a device driver should +always call free_irq() on all MSI-X vectors it has done request_irq() +on before calling this API. Failure to do so results a BUG_ON() and +a device will be left with MSI-X enabled and leaks its vectors. + +5.3.6 MSI-X mode vs. legacy mode diagram + +The below diagram shows the events, which switches the interrupt +mode on the MSI-X capable device function between MSI-X mode and +PIN-IRQ assertion mode (legacy). + + ------------ pci_enable_msix(,,n) ------------------------ + | | <=============== | | + | MSI-X MODE | | PIN-IRQ ASSERTION MODE | + | | ===============> | | + ------------ pci_disable_msix ------------------------ + +Figure 2.0 MSI-X Mode vs. Legacy Mode + +In Figure 2.0, a device operates by default in legacy mode. A +successful MSI-X request (using pci_enable_msix()) switches a +device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector +stored in dev->irq will be saved by the PCI subsystem; however, +unlike MSI mode, the PCI subsystem will not replace dev->irq with +assigned MSI-X vector because the PCI subsystem already writes the 1:1 +vector-to-entry mapping into the field vector of each element +specified in second argument. + +To return back to its default mode, a device driver should always call +pci_disable_msix() to undo the effect of pci_enable_msix(). Note that +a device driver should always call free_irq() on all MSI-X vectors it +has done request_irq() on before calling pci_disable_msix(). Failure +to do so results a BUG_ON() and a device will be left with MSI-X +enabled and leaks its vectors. Otherwise, the PCI subsystem switches a +device function's interrupt mode from MSI-X mode to legacy mode and +marks all allocated MSI-X vectors as unused. + +Once being marked as unused, there is no guarantee that the PCI +subsystem will reserve these MSI-X vectors for a device. Depending on +the availability of current PCI vector resources and the number of +MSI/MSI-X requests from other drivers, these MSI-X vectors may be +re-assigned. + +For the case where the PCI subsystem re-assigned these MSI-X vectors +to other driver, a request to switching back to MSI-X mode may result +being assigned with another set of MSI-X vectors or a failure if no +more vectors are available. + +5.4 Handling function implementng both MSI and MSI-X capabilities + +For the case where a function implements both MSI and MSI-X +capabilities, the PCI subsystem enables a device to run either in MSI +mode or MSI-X mode but not both. A device driver determines whether it +wants MSI or MSI-X enabled on its hardware device. Once a device +driver requests for MSI, for example, it is prohibited to request for +MSI-X; in other words, a device driver is not permitted to ping-pong +between MSI mod MSI-X mode during a run-time. + +5.5 Hardware requirements for MSI/MSI-X support +MSI/MSI-X support requires support from both system hardware and +individual hardware device functions. + +5.5.1 System hardware support +Since the target of MSI address is the local APIC CPU, enabling +MSI/MSI-X support in Linux kernel is dependent on whether existing +system hardware supports local APIC. Users should verify their +system whether it runs when CONFIG_X86_LOCAL_APIC=y. + +In SMP environment, CONFIG_X86_LOCAL_APIC is automatically set; +however, in UP environment, users must manually set +CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting +CONFIG_PCI_MSI enables the VECTOR based scheme and +the option for MSI-capable device drivers to selectively enable +MSI/MSI-X. + +Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X +vector is allocated new during runtime and MSI/MSI-X support does not +depend on BIOS support. This key independency enables MSI/MSI-X +support on future IOxAPIC free platform. + +5.5.2 Device hardware support +The hardware device function supports MSI by indicating the +MSI/MSI-X capability structure on its PCI capability list. By +default, this capability structure will not be initialized by +the kernel to enable MSI during the system boot. In other words, +the device function is running on its default pin assertion mode. +Note that in many cases the hardware supporting MSI have bugs, +which may result in system hang. The software driver of specific +MSI-capable hardware is responsible for whether calling +pci_enable_msi or not. A return of zero indicates the kernel +successfully initializes the MSI/MSI-X capability structure of the +device funtion. The device function is now running on MSI/MSI-X mode. + +5.6 How to tell whether MSI/MSI-X is enabled on device function + +At the driver level, a return of zero from the function call of +pci_enable_msi()/pci_enable_msix() indicates to a device driver that +its device function is initialized successfully and ready to run in +MSI/MSI-X mode. + +At the user level, users can use command 'cat /proc/interrupts' +to display the vector allocated for a device and its interrupt +MSI/MSI-X mode ("PCI MSI"/"PCI MSIX"). Below shows below MSI mode is +enabled on a SCSI Adaptec 39320D Ultra320. + + CPU0 CPU1 + 0: 324639 0 IO-APIC-edge timer + 1: 1186 0 IO-APIC-edge i8042 + 2: 0 0 XT-PIC cascade + 12: 2797 0 IO-APIC-edge i8042 + 14: 6543 0 IO-APIC-edge ide0 + 15: 1 0 IO-APIC-edge ide1 +169: 0 0 IO-APIC-level uhci-hcd +185: 0 0 IO-APIC-level uhci-hcd +193: 138 10 PCI MSI aic79xx +201: 30 0 PCI MSI aic79xx +225: 30 0 IO-APIC-level aic7xxx +233: 30 0 IO-APIC-level aic7xxx +NMI: 0 0 +LOC: 324553 325068 +ERR: 0 +MIS: 0 + +6. FAQ + +Q1. Are there any limitations on using the MSI? + +A1. If the PCI device supports MSI and conforms to the +specification and the platform supports the APIC local bus, +then using MSI should work. + +Q2. Will it work on all the Pentium processors (P3, P4, Xeon, +AMD processors)? In P3 IPI's are transmitted on the APIC local +bus and in P4 and Xeon they are transmitted on the system +bus. Are there any implications with this? + +A2. MSI support enables a PCI device sending an inbound +memory write (0xfeexxxxx as target address) on its PCI bus +directly to the FSB. Since the message address has a +redirection hint bit cleared, it should work. + +Q3. The target address 0xfeexxxxx will be translated by the +Host Bridge into an interrupt message. Are there any +limitations on the chipsets such as Intel 8xx, Intel e7xxx, +or VIA? + +A3. If these chipsets support an inbound memory write with +target address set as 0xfeexxxxx, as conformed to PCI +specification 2.3 or latest, then it should work. + +Q4. From the driver point of view, if the MSI is lost because +of the errors occur during inbound memory write, then it may +wait for ever. Is there a mechanism for it to recover? + +A4. Since the target of the transaction is an inbound memory +write, all transaction termination conditions (Retry, +Master-Abort, Target-Abort, or normal completion) are +supported. A device sending an MSI must abide by all the PCI +rules and conditions regarding that inbound memory write. So, +if a retry is signaled it must retry, etc... We believe that +the recommendation for Abort is also a retry (refer to PCI +specification 2.3 or latest). diff --git a/Documentation/ManagementStyle b/Documentation/ManagementStyle new file mode 100644 index 000000000000..cbbebfb51ffe --- /dev/null +++ b/Documentation/ManagementStyle @@ -0,0 +1,276 @@ + + Linux kernel management style + +This is a short document describing the preferred (or made up, depending +on who you ask) management style for the linux kernel. It's meant to +mirror the CodingStyle document to some degree, and mainly written to +avoid answering (*) the same (or similar) questions over and over again. + +Management style is very personal and much harder to quantify than +simple coding style rules, so this document may or may not have anything +to do with reality. It started as a lark, but that doesn't mean that it +might not actually be true. You'll have to decide for yourself. + +Btw, when talking about "kernel manager", it's all about the technical +lead persons, not the people who do traditional management inside +companies. If you sign purchase orders or you have any clue about the +budget of your group, you're almost certainly not a kernel manager. +These suggestions may or may not apply to you. + +First off, I'd suggest buying "Seven Habits of Highly Successful +People", and NOT read it. Burn it, it's a great symbolic gesture. + +(*) This document does so not so much by answering the question, but by +making it painfully obvious to the questioner that we don't have a clue +to what the answer is. + +Anyway, here goes: + + + Chapter 1: Decisions + +Everybody thinks managers make decisions, and that decision-making is +important. The bigger and more painful the decision, the bigger the +manager must be to make it. That's very deep and obvious, but it's not +actually true. + +The name of the game is to _avoid_ having to make a decision. In +particular, if somebody tells you "choose (a) or (b), we really need you +to decide on this", you're in trouble as a manager. The people you +manage had better know the details better than you, so if they come to +you for a technical decision, you're screwed. You're clearly not +competent to make that decision for them. + +(Corollary:if the people you manage don't know the details better than +you, you're also screwed, although for a totally different reason. +Namely that you are in the wrong job, and that _they_ should be managing +your brilliance instead). + +So the name of the game is to _avoid_ decisions, at least the big and +painful ones. Making small and non-consequential decisions is fine, and +makes you look like you know what you're doing, so what a kernel manager +needs to do is to turn the big and painful ones into small things where +nobody really cares. + +It helps to realize that the key difference between a big decision and a +small one is whether you can fix your decision afterwards. Any decision +can be made small by just always making sure that if you were wrong (and +you _will_ be wrong), you can always undo the damage later by +backtracking. Suddenly, you get to be doubly managerial for making +_two_ inconsequential decisions - the wrong one _and_ the right one. + +And people will even see that as true leadership (*cough* bullshit +*cough*). + +Thus the key to avoiding big decisions becomes to just avoiding to do +things that can't be undone. Don't get ushered into a corner from which +you cannot escape. A cornered rat may be dangerous - a cornered manager +is just pitiful. + +It turns out that since nobody would be stupid enough to ever really let +a kernel manager have huge fiscal responsibility _anyway_, it's usually +fairly easy to backtrack. Since you're not going to be able to waste +huge amounts of money that you might not be able to repay, the only +thing you can backtrack on is a technical decision, and there +back-tracking is very easy: just tell everybody that you were an +incompetent nincompoop, say you're sorry, and undo all the worthless +work you had people work on for the last year. Suddenly the decision +you made a year ago wasn't a big decision after all, since it could be +easily undone. + +It turns out that some people have trouble with this approach, for two +reasons: + - admitting you were an idiot is harder than it looks. We all like to + maintain appearances, and coming out in public to say that you were + wrong is sometimes very hard indeed. + - having somebody tell you that what you worked on for the last year + wasn't worthwhile after all can be hard on the poor lowly engineers + too, and while the actual _work_ was easy enough to undo by just + deleting it, you may have irrevocably lost the trust of that + engineer. And remember: "irrevocable" was what we tried to avoid in + the first place, and your decision ended up being a big one after + all. + +Happily, both of these reasons can be mitigated effectively by just +admitting up-front that you don't have a friggin' clue, and telling +people ahead of the fact that your decision is purely preliminary, and +might be the wrong thing. You should always reserve the right to change +your mind, and make people very _aware_ of that. And it's much easier +to admit that you are stupid when you haven't _yet_ done the really +stupid thing. + +Then, when it really does turn out to be stupid, people just roll their +eyes and say "Oops, he did it again". + +This preemptive admission of incompetence might also make the people who +actually do the work also think twice about whether it's worth doing or +not. After all, if _they_ aren't certain whether it's a good idea, you +sure as hell shouldn't encourage them by promising them that what they +work on will be included. Make them at least think twice before they +embark on a big endeavor. + +Remember: they'd better know more about the details than you do, and +they usually already think they have the answer to everything. The best +thing you can do as a manager is not to instill confidence, but rather a +healthy dose of critical thinking on what they do. + +Btw, another way to avoid a decision is to plaintively just whine "can't +we just do both?" and look pitiful. Trust me, it works. If it's not +clear which approach is better, they'll eventually figure it out. The +answer may end up being that both teams get so frustrated by the +situation that they just give up. + +That may sound like a failure, but it's usually a sign that there was +something wrong with both projects, and the reason the people involved +couldn't decide was that they were both wrong. You end up coming up +smelling like roses, and you avoided yet another decision that you could +have screwed up on. + + + Chapter 2: People + +Most people are idiots, and being a manager means you'll have to deal +with it, and perhaps more importantly, that _they_ have to deal with +_you_. + +It turns out that while it's easy to undo technical mistakes, it's not +as easy to undo personality disorders. You just have to live with +theirs - and yours. + +However, in order to prepare yourself as a kernel manager, it's best to +remember not to burn any bridges, bomb any innocent villagers, or +alienate too many kernel developers. It turns out that alienating people +is fairly easy, and un-alienating them is hard. Thus "alienating" +immediately falls under the heading of "not reversible", and becomes a +no-no according to Chapter 1. + +There's just a few simple rules here: + (1) don't call people d*ckheads (at least not in public) + (2) learn how to apologize when you forgot rule (1) + +The problem with #1 is that it's very easy to do, since you can say +"you're a d*ckhead" in millions of different ways (*), sometimes without +even realizing it, and almost always with a white-hot conviction that +you are right. + +And the more convinced you are that you are right (and let's face it, +you can call just about _anybody_ a d*ckhead, and you often _will_ be +right), the harder it ends up being to apologize afterwards. + +To solve this problem, you really only have two options: + - get really good at apologies + - spread the "love" out so evenly that nobody really ends up feeling + like they get unfairly targeted. Make it inventive enough, and they + might even be amused. + +The option of being unfailingly polite really doesn't exist. Nobody will +trust somebody who is so clearly hiding his true character. + +(*) Paul Simon sang "Fifty Ways to Lose Your Lover", because quite +frankly, "A Million Ways to Tell a Developer He Is a D*ckhead" doesn't +scan nearly as well. But I'm sure he thought about it. + + + Chapter 3: People II - the Good Kind + +While it turns out that most people are idiots, the corollary to that is +sadly that you are one too, and that while we can all bask in the secure +knowledge that we're better than the average person (let's face it, +nobody ever believes that they're average or below-average), we should +also admit that we're not the sharpest knife around, and there will be +other people that are less of an idiot that you are. + +Some people react badly to smart people. Others take advantage of them. + +Make sure that you, as a kernel maintainer, are in the second group. +Suck up to them, because they are the people who will make your job +easier. In particular, they'll be able to make your decisions for you, +which is what the game is all about. + +So when you find somebody smarter than you are, just coast along. Your +management responsibilities largely become ones of saying "Sounds like a +good idea - go wild", or "That sounds good, but what about xxx?". The +second version in particular is a great way to either learn something +new about "xxx" or seem _extra_ managerial by pointing out something the +smarter person hadn't thought about. In either case, you win. + +One thing to look out for is to realize that greatness in one area does +not necessarily translate to other areas. So you might prod people in +specific directions, but let's face it, they might be good at what they +do, and suck at everything else. The good news is that people tend to +naturally gravitate back to what they are good at, so it's not like you +are doing something irreversible when you _do_ prod them in some +direction, just don't push too hard. + + + Chapter 4: Placing blame + +Things will go wrong, and people want somebody to blame. Tag, you're it. + +It's not actually that hard to accept the blame, especially if people +kind of realize that it wasn't _all_ your fault. Which brings us to the +best way of taking the blame: do it for another guy. You'll feel good +for taking the fall, he'll feel good about not getting blamed, and the +guy who lost his whole 36GB porn-collection because of your incompetence +will grudgingly admit that you at least didn't try to weasel out of it. + +Then make the developer who really screwed up (if you can find him) know +_in_private_ that he screwed up. Not just so he can avoid it in the +future, but so that he knows he owes you one. And, perhaps even more +importantly, he's also likely the person who can fix it. Because, let's +face it, it sure ain't you. + +Taking the blame is also why you get to be manager in the first place. +It's part of what makes people trust you, and allow you the potential +glory, because you're the one who gets to say "I screwed up". And if +you've followed the previous rules, you'll be pretty good at saying that +by now. + + + Chapter 5: Things to avoid + +There's one thing people hate even more than being called "d*ckhead", +and that is being called a "d*ckhead" in a sanctimonious voice. The +first you can apologize for, the second one you won't really get the +chance. They likely will no longer be listening even if you otherwise +do a good job. + +We all think we're better than anybody else, which means that when +somebody else puts on airs, it _really_ rubs us the wrong way. You may +be morally and intellectually superior to everybody around you, but +don't try to make it too obvious unless you really _intend_ to irritate +somebody (*). + +Similarly, don't be too polite or subtle about things. Politeness easily +ends up going overboard and hiding the problem, and as they say, "On the +internet, nobody can hear you being subtle". Use a big blunt object to +hammer the point in, because you can't really depend on people getting +your point otherwise. + +Some humor can help pad both the bluntness and the moralizing. Going +overboard to the point of being ridiculous can drive a point home +without making it painful to the recipient, who just thinks you're being +silly. It can thus help get through the personal mental block we all +have about criticism. + +(*) Hint: internet newsgroups that are not directly related to your work +are great ways to take out your frustrations at other people. Write +insulting posts with a sneer just to get into a good flame every once in +a while, and you'll feel cleansed. Just don't crap too close to home. + + + Chapter 6: Why me? + +Since your main responsibility seems to be to take the blame for other +peoples mistakes, and make it painfully obvious to everybody else that +you're incompetent, the obvious question becomes one of why do it in the +first place? + +First off, while you may or may not get screaming teenage girls (or +boys, let's not be judgmental or sexist here) knocking on your dressing +room door, you _will_ get an immense feeling of personal accomplishment +for being "in charge". Never mind the fact that you're really leading +by trying to keep up with everybody else and running after them as fast +as you can. Everybody will still think you're the person in charge. + +It's a great job if you can hack it. diff --git a/Documentation/PCIEBUS-HOWTO.txt b/Documentation/PCIEBUS-HOWTO.txt new file mode 100644 index 000000000000..c93f42a74d7e --- /dev/null +++ b/Documentation/PCIEBUS-HOWTO.txt @@ -0,0 +1,217 @@ + The PCI Express Port Bus Driver Guide HOWTO + Tom L Nguyen tom.l.nguyen@intel.com + 11/03/2004 + +1. About this guide + +This guide describes the basics of the PCI Express Port Bus driver +and provides information on how to enable the service drivers to +register/unregister with the PCI Express Port Bus Driver. + +2. Copyright 2004 Intel Corporation + +3. What is the PCI Express Port Bus Driver + +A PCI Express Port is a logical PCI-PCI Bridge structure. There +are two types of PCI Express Port: the Root Port and the Switch +Port. The Root Port originates a PCI Express link from a PCI Express +Root Complex and the Switch Port connects PCI Express links to +internal logical PCI buses. The Switch Port, which has its secondary +bus representing the switch's internal routing logic, is called the +switch's Upstream Port. The switch's Downstream Port is bridging from +switch's internal routing bus to a bus representing the downstream +PCI Express link from the PCI Express Switch. + +A PCI Express Port can provide up to four distinct functions, +referred to in this document as services, depending on its port type. +PCI Express Port's services include native hotplug support (HP), +power management event support (PME), advanced error reporting +support (AER), and virtual channel support (VC). These services may +be handled by a single complex driver or be individually distributed +and handled by corresponding service drivers. + +4. Why use the PCI Express Port Bus Driver? + +In existing Linux kernels, the Linux Device Driver Model allows a +physical device to be handled by only a single driver. The PCI +Express Port is a PCI-PCI Bridge device with multiple distinct +services. To maintain a clean and simple solution each service +may have its own software service driver. In this case several +service drivers will compete for a single PCI-PCI Bridge device. +For example, if the PCI Express Root Port native hotplug service +driver is loaded first, it claims a PCI-PCI Bridge Root Port. The +kernel therefore does not load other service drivers for that Root +Port. In other words, it is impossible to have multiple service +drivers load and run on a PCI-PCI Bridge device simultaneously +using the current driver model. + +To enable multiple service drivers running simultaneously requires +having a PCI Express Port Bus driver, which manages all populated +PCI Express Ports and distributes all provided service requests +to the corresponding service drivers as required. Some key +advantages of using the PCI Express Port Bus driver are listed below: + + - Allow multiple service drivers to run simultaneously on + a PCI-PCI Bridge Port device. + + - Allow service drivers implemented in an independent + staged approach. + + - Allow one service driver to run on multiple PCI-PCI Bridge + Port devices. + + - Manage and distribute resources of a PCI-PCI Bridge Port + device to requested service drivers. + +5. Configuring the PCI Express Port Bus Driver vs. Service Drivers + +5.1 Including the PCI Express Port Bus Driver Support into the Kernel + +Including the PCI Express Port Bus driver depends on whether the PCI +Express support is included in the kernel config. The kernel will +automatically include the PCI Express Port Bus driver as a kernel +driver when the PCI Express support is enabled in the kernel. + +5.2 Enabling Service Driver Support + +PCI device drivers are implemented based on Linux Device Driver Model. +All service drivers are PCI device drivers. As discussed above, it is +impossible to load any service driver once the kernel has loaded the +PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver +Model requires some minimal changes on existing service drivers that +imposes no impact on the functionality of existing service drivers. + +A service driver is required to use the two APIs shown below to +register its service with the PCI Express Port Bus driver (see +section 5.2.1 & 5.2.2). It is important that a service driver +initializes the pcie_port_service_driver data structure, included in +header file /include/linux/pcieport_if.h, before calling these APIs. +Failure to do so will result an identity mismatch, which prevents +the PCI Express Port Bus driver from loading a service driver. + +5.2.1 pcie_port_service_register + +int pcie_port_service_register(struct pcie_port_service_driver *new) + +This API replaces the Linux Driver Model's pci_module_init API. A +service driver should always calls pcie_port_service_register at +module init. Note that after service driver being loaded, calls +such as pci_enable_device(dev) and pci_set_master(dev) are no longer +necessary since these calls are executed by the PCI Port Bus driver. + +5.2.2 pcie_port_service_unregister + +void pcie_port_service_unregister(struct pcie_port_service_driver *new) + +pcie_port_service_unregister replaces the Linux Driver Model's +pci_unregister_driver. It's always called by service driver when a +module exits. + +5.2.3 Sample Code + +Below is sample service driver code to initialize the port service +driver data structure. + +static struct pcie_port_service_id service_id[] = { { + .vendor = PCI_ANY_ID, + .device = PCI_ANY_ID, + .port_type = PCIE_RC_PORT, + .service_type = PCIE_PORT_SERVICE_AER, + }, { /* end: all zeroes */ } +}; + +static struct pcie_port_service_driver root_aerdrv = { + .name = (char *)device_name, + .id_table = &service_id[0], + + .probe = aerdrv_load, + .remove = aerdrv_unload, + + .suspend = aerdrv_suspend, + .resume = aerdrv_resume, +}; + +Below is a sample code for registering/unregistering a service +driver. + +static int __init aerdrv_service_init(void) +{ + int retval = 0; + + retval = pcie_port_service_register(&root_aerdrv); + if (!retval) { + /* + * FIX ME + */ + } + return retval; +} + +static void __exit aerdrv_service_exit(void) +{ + pcie_port_service_unregister(&root_aerdrv); +} + +module_init(aerdrv_service_init); +module_exit(aerdrv_service_exit); + +6. Possible Resource Conflicts + +Since all service drivers of a PCI-PCI Bridge Port device are +allowed to run simultaneously, below lists a few of possible resource +conflicts with proposed solutions. + +6.1 MSI Vector Resource + +The MSI capability structure enables a device software driver to call +pci_enable_msi to request MSI based interrupts. Once MSI interrupts +are enabled on a device, it stays in this mode until a device driver +calls pci_disable_msi to disable MSI interrupts and revert back to +INTx emulation mode. Since service drivers of the same PCI-PCI Bridge +port share the same physical device, if an individual service driver +calls pci_enable_msi/pci_disable_msi it may result unpredictable +behavior. For example, two service drivers run simultaneously on the +same physical Root Port. Both service drivers call pci_enable_msi to +request MSI based interrupts. A service driver may not know whether +any other service drivers have run on this Root Port. If either one +of them calls pci_disable_msi, it puts the other service driver +in a wrong interrupt mode. + +To avoid this situation all service drivers are not permitted to +switch interrupt mode on its device. The PCI Express Port Bus driver +is responsible for determining the interrupt mode and this should be +transparent to service drivers. Service drivers need to know only +the vector IRQ assigned to the field irq of struct pcie_device, which +is passed in when the PCI Express Port Bus driver probes each service +driver. Service drivers should use (struct pcie_device*)dev->irq to +call request_irq/free_irq. In addition, the interrupt mode is stored +in the field interrupt_mode of struct pcie_device. + +6.2 MSI-X Vector Resources + +Similar to the MSI a device driver for an MSI-X capable device can +call pci_enable_msix to request MSI-X interrupts. All service drivers +are not permitted to switch interrupt mode on its device. The PCI +Express Port Bus driver is responsible for determining the interrupt +mode and this should be transparent to service drivers. Any attempt +by service driver to call pci_enable_msix/pci_disable_msix may +result unpredictable behavior. Service drivers should use +(struct pcie_device*)dev->irq and call request_irq/free_irq. + +6.3 PCI Memory/IO Mapped Regions + +Service drivers for PCI Express Power Management (PME), Advanced +Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access +PCI configuration space on the PCI Express port. In all cases the +registers accessed are independent of each other. This patch assumes +that all service drivers will be well behaved and not overwrite +other service driver's configuration settings. + +6.4 PCI Config Registers + +Each service driver runs its PCI config operations on its own +capability structure except the PCI Express capability structure, in +which Root Control register and Device Control register are shared +between PME and AER. This patch assumes that all service drivers +will be well behaved and not overwrite other service driver's +configuration settings. diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt new file mode 100644 index 000000000000..12250b342e1f --- /dev/null +++ b/Documentation/RCU/RTFP.txt @@ -0,0 +1,387 @@ +Read the F-ing Papers! + + +This document describes RCU-related publications, and is followed by +the corresponding bibtex entries. + +The first thing resembling RCU was published in 1980, when Kung and Lehman +[Kung80] recommended use of a garbage collector to defer destruction +of nodes in a parallel binary search tree in order to simplify its +implementation. This works well in environments that have garbage +collectors, but current production garbage collectors incur significant +read-side overhead. + +In 1982, Manber and Ladner [Manber82,Manber84] recommended deferring +destruction until all threads running at that time have terminated, again +for a parallel binary search tree. This approach works well in systems +with short-lived threads, such as the K42 research operating system. +However, Linux has long-lived tasks, so more is needed. + +In 1986, Hennessy, Osisek, and Seigh [Hennessy89] introduced passive +serialization, which is an RCU-like mechanism that relies on the presence +of "quiescent states" in the VM/XA hypervisor that are guaranteed not +to be referencing the data structure. However, this mechanism was not +optimized for modern computer systems, which is not surprising given +that these overheads were not so expensive in the mid-80s. Nonetheless, +passive serialization appears to be the first deferred-destruction +mechanism to be used in production. Furthermore, the relevant patent has +lapsed, so this approach may be used in non-GPL software, if desired. +(In contrast, use of RCU is permitted only in software licensed under +GPL. Sorry!!!) + +In 1990, Pugh [Pugh90] noted that explicitly tracking which threads +were reading a given data structure permitted deferred free to operate +in the presence of non-terminating threads. However, this explicit +tracking imposes significant read-side overhead, which is undesirable +in read-mostly situations. This algorithm does take pains to avoid +write-side contention and parallelize the other write-side overheads by +providing a fine-grained locking design, however, it would be interesting +to see how much of the performance advantage reported in 1990 remains +in 2004. + +At about this same time, Adams [Adams91] described ``chaotic relaxation'', +where the normal barriers between successive iterations of convergent +numerical algorithms are relaxed, so that iteration $n$ might use +data from iteration $n-1$ or even $n-2$. This introduces error, +which typically slows convergence and thus increases the number of +iterations required. However, this increase is sometimes more than made +up for by a reduction in the number of expensive barrier operations, +which are otherwise required to synchronize the threads at the end +of each iteration. Unfortunately, chaotic relaxation requires highly +structured data, such as the matrices used in scientific programs, and +is thus inapplicable to most data structures in operating-system kernels. + +In 1993, Jacobson [Jacobson93] verbally described what is perhaps the +simplest deferred-free technique: simply waiting a fixed amount of time +before freeing blocks awaiting deferred free. Jacobson did not describe +any write-side changes he might have made in this work using SGI's Irix +kernel. Aju John published a similar technique in 1995 [AjuJohn95]. +This works well if there is a well-defined upper bound on the length of +time that reading threads can hold references, as there might well be in +hard real-time systems. However, if this time is exceeded, perhaps due +to preemption, excessive interrupts, or larger-than-anticipated load, +memory corruption can ensue, with no reasonable means of diagnosis. +Jacobson's technique is therefore inappropriate for use in production +operating-system kernels, except when such kernels can provide hard +real-time response guarantees for all operations. + +Also in 1995, Pu et al. [Pu95a] applied a technique similar to that of Pugh's +read-side-tracking to permit replugging of algorithms within a commercial +Unix operating system. However, this replugging permitted only a single +reader at a time. The following year, this same group of researchers +extended their technique to allow for multiple readers [Cowan96a]. +Their approach requires memory barriers (and thus pipeline stalls), +but reduces memory latency, contention, and locking overheads. + +1995 also saw the first publication of DYNIX/ptx's RCU mechanism +[Slingwine95], which was optimized for modern CPU architectures, +and was successfully applied to a number of situations within the +DYNIX/ptx kernel. The corresponding conference paper appeared in 1998 +[McKenney98]. + +In 1999, the Tornado and K42 groups described their "generations" +mechanism, which quite similar to RCU [Gamsa99]. These operating systems +made pervasive use of RCU in place of "existence locks", which greatly +simplifies locking hierarchies. + +2001 saw the first RCU presentation involving Linux [McKenney01a] +at OLS. The resulting abundance of RCU patches was presented the +following year [McKenney02a], and use of RCU in dcache was first +described that same year [Linder02a]. + +Also in 2002, Michael [Michael02b,Michael02a] presented techniques +that defer the destruction of data structures to simplify non-blocking +synchronization (wait-free synchronization, lock-free synchronization, +and obstruction-free synchronization are all examples of non-blocking +synchronization). In particular, this technique eliminates locking, +reduces contention, reduces memory latency for readers, and parallelizes +pipeline stalls and memory latency for writers. However, these +techniques still impose significant read-side overhead in the form of +memory barriers. Researchers at Sun worked along similar lines in the +same timeframe [HerlihyLM02,HerlihyLMS03]. + +In 2003, the K42 group described how RCU could be used to create +hot-pluggable implementations of operating-system functions. Later that +year saw a paper describing an RCU implementation of System V IPC +[Arcangeli03], and an introduction to RCU in Linux Journal [McKenney03a]. + +2004 has seen a Linux-Journal article on use of RCU in dcache +[McKenney04a], a performance comparison of locking to RCU on several +different CPUs [McKenney04b], a dissertation describing use of RCU in a +number of operating-system kernels [PaulEdwardMcKenneyPhD], and a paper +describing how to make RCU safe for soft-realtime applications [Sarma04c]. + + +Bibtex Entries + +@article{Kung80 +,author="H. T. Kung and Q. Lehman" +,title="Concurrent Maintenance of Binary Search Trees" +,Year="1980" +,Month="September" +,journal="ACM Transactions on Database Systems" +,volume="5" +,number="3" +,pages="354-382" +} + +@techreport{Manber82 +,author="Udi Manber and Richard E. Ladner" +,title="Concurrency Control in a Dynamic Search Structure" +,institution="Department of Computer Science, University of Washington" +,address="Seattle, Washington" +,year="1982" +,number="82-01-01" +,month="January" +,pages="28" +} + +@article{Manber84 +,author="Udi Manber and Richard E. Ladner" +,title="Concurrency Control in a Dynamic Search Structure" +,Year="1984" +,Month="September" +,journal="ACM Transactions on Database Systems" +,volume="9" +,number="3" +,pages="439-455" +} + +@techreport{Hennessy89 +,author="James P. Hennessy and Damian L. Osisek and Joseph W. {Seigh II}" +,title="Passive Serialization in a Multitasking Environment" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1989" +,number="US Patent 4,809,168 (lapsed)" +,month="February" +,pages="11" +} + +@techreport{Pugh90 +,author="William Pugh" +,title="Concurrent Maintenance of Skip Lists" +,institution="Institute of Advanced Computer Science Studies, Department of Computer Science, University of Maryland" +,address="College Park, Maryland" +,year="1990" +,number="CS-TR-2222.1" +,month="June" +} + +@Book{Adams91 +,Author="Gregory R. Adams" +,title="Concurrent Programming, Principles, and Practices" +,Publisher="Benjamin Cummins" +,Year="1991" +} + +@unpublished{Jacobson93 +,author="Van Jacobson" +,title="Avoid Read-Side Locking Via Delayed Free" +,year="1993" +,month="September" +,note="Verbal discussion" +} + +@Conference{AjuJohn95 +,Author="Aju John" +,Title="Dynamic vnodes -- Design and Implementation" +,Booktitle="{USENIX Winter 1995}" +,Publisher="USENIX Association" +,Month="January" +,Year="1995" +,pages="11-23" +,Address="New Orleans, LA" +} + +@techreport{Slingwine95 +,author="John D. Slingwine and Paul E. McKenney" +,title="Apparatus and Method for Achieving Reduced Overhead Mutual +Exclusion and Maintaining Coherency in a Multiprocessor System +Utilizing Execution History and Thread Monitoring" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1995" +,number="US Patent 5,442,758 (contributed under GPL)" +,month="August" +} + +@techreport{Slingwine97 +,author="John D. Slingwine and Paul E. McKenney" +,title="Method for maintaining data coherency using thread +activity summaries in a multicomputer system" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1997" +,number="US Patent 5,608,893 (contributed under GPL)" +,month="March" +} + +@techreport{Slingwine98 +,author="John D. Slingwine and Paul E. McKenney" +,title="Apparatus and method for achieving reduced overhead +mutual exclusion and maintaining coherency in a multiprocessor +system utilizing execution history and thread monitoring" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1998" +,number="US Patent 5,727,209 (contributed under GPL)" +,month="March" +} + +@Conference{McKenney98 +,Author="Paul E. McKenney and John D. Slingwine" +,Title="Read-Copy Update: Using Execution History to Solve Concurrency +Problems" +,Booktitle="{Parallel and Distributed Computing and Systems}" +,Month="October" +,Year="1998" +,pages="509-518" +,Address="Las Vegas, NV" +} + +@Conference{Gamsa99 +,Author="Ben Gamsa and Orran Krieger and Jonathan Appavoo and Michael Stumm" +,Title="Tornado: Maximizing Locality and Concurrency in a Shared Memory +Multiprocessor Operating System" +,Booktitle="{Proceedings of the 3\textsuperscript{rd} Symposium on +Operating System Design and Implementation}" +,Month="February" +,Year="1999" +,pages="87-100" +,Address="New Orleans, LA" +} + +@techreport{Slingwine01 +,author="John D. Slingwine and Paul E. McKenney" +,title="Apparatus and method for achieving reduced overhead +mutual exclusion and maintaining coherency in a multiprocessor +system utilizing execution history and thread monitoring" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="2001" +,number="US Patent 5,219,690 (contributed under GPL)" +,month="April" +} + +@Conference{McKenney01a +,Author="Paul E. McKenney and Jonathan Appavoo and Andi Kleen and +Orran Krieger and Rusty Russell and Dipankar Sarma and Maneesh Soni" +,Title="Read-Copy Update" +,Booktitle="{Ottawa Linux Symposium}" +,Month="July" +,Year="2001" +,note="Available: +\url{http://www.linuxsymposium.org/2001/abstracts/readcopy.php} +\url{http://www.rdrop.com/users/paulmck/rclock/rclock_OLS.2001.05.01c.pdf} +[Viewed June 23, 2004]" +annotation=" +Described RCU, and presented some patches implementing and using it in +the Linux kernel. +" +} + +@Conference{Linder02a +,Author="Hanna Linder and Dipankar Sarma and Maneesh Soni" +,Title="Scalability of the Directory Entry Cache" +,Booktitle="{Ottawa Linux Symposium}" +,Month="June" +,Year="2002" +,pages="289-300" +} + +@Conference{McKenney02a +,Author="Paul E. McKenney and Dipankar Sarma and +Andrea Arcangeli and Andi Kleen and Orran Krieger and Rusty Russell" +,Title="Read-Copy Update" +,Booktitle="{Ottawa Linux Symposium}" +,Month="June" +,Year="2002" +,pages="338-367" +,note="Available: +\url{http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz} +[Viewed June 23, 2004]" +} + +@article{Appavoo03a +,author="J. Appavoo and K. Hui and C. A. N. Soules and R. W. Wisniewski and +D. M. {Da Silva} and O. Krieger and M. A. Auslander and D. J. Edelsohn and +B. Gamsa and G. R. Ganger and P. McKenney and M. Ostrowski and +B. Rosenburg and M. Stumm and J. Xenidis" +,title="Enabling Autonomic Behavior in Systems Software With Hot Swapping" +,Year="2003" +,Month="January" +,journal="IBM Systems Journal" +,volume="42" +,number="1" +,pages="60-76" +} + +@Conference{Arcangeli03 +,Author="Andrea Arcangeli and Mingming Cao and Paul E. McKenney and +Dipankar Sarma" +,Title="Using Read-Copy Update Techniques for {System V IPC} in the +{Linux} 2.5 Kernel" +,Booktitle="Proceedings of the 2003 USENIX Annual Technical Conference +(FREENIX Track)" +,Publisher="USENIX Association" +,year="2003" +,month="June" +,pages="297-310" +} + +@article{McKenney03a +,author="Paul E. McKenney" +,title="Using {RCU} in the {Linux} 2.5 Kernel" +,Year="2003" +,Month="October" +,journal="Linux Journal" +,volume="1" +,number="114" +,pages="18-26" +} + +@article{McKenney04a +,author="Paul E. McKenney and Dipankar Sarma and Maneesh Soni" +,title="Scaling dcache with {RCU}" +,Year="2004" +,Month="January" +,journal="Linux Journal" +,volume="1" +,number="118" +,pages="38-46" +} + +@Conference{McKenney04b +,Author="Paul E. McKenney" +,Title="{RCU} vs. Locking Performance on Different {CPUs}" +,Booktitle="{linux.conf.au}" +,Month="January" +,Year="2004" +,Address="Adelaide, Australia" +,note="Available: +\url{http://www.linux.org.au/conf/2004/abstracts.html#90} +\url{http://www.rdrop.com/users/paulmck/rclock/lockperf.2004.01.17a.pdf} +[Viewed June 23, 2004]" +} + +@phdthesis{PaulEdwardMcKenneyPhD +,author="Paul E. McKenney" +,title="Exploiting Deferred Destruction: +An Analysis of Read-Copy-Update Techniques +in Operating System Kernels" +,school="OGI School of Science and Engineering at +Oregon Health and Sciences University" +,year="2004" +} + +@Conference{Sarma04c +,Author="Dipankar Sarma and Paul E. McKenney" +,Title="Making RCU Safe for Deep Sub-Millisecond Response Realtime Applications" +,Booktitle="Proceedings of the 2004 USENIX Annual Technical Conference +(FREENIX Track)" +,Publisher="USENIX Association" +,year="2004" +,month="June" +,pages="182-191" +} diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt new file mode 100644 index 000000000000..551a803d82a8 --- /dev/null +++ b/Documentation/RCU/UP.txt @@ -0,0 +1,64 @@ +RCU on Uniprocessor Systems + + +A common misconception is that, on UP systems, the call_rcu() primitive +may immediately invoke its function, and that the synchronize_kernel +primitive may return immediately. The basis of this misconception +is that since there is only one CPU, it should not be necessary to +wait for anything else to get done, since there are no other CPUs for +anything else to be happening on. Although this approach will sort of +work a surprising amount of the time, it is a very bad idea in general. +This document presents two examples that demonstrate exactly how bad an +idea this is. + + +Example 1: softirq Suicide + +Suppose that an RCU-based algorithm scans a linked list containing +elements A, B, and C in process context, and can delete elements from +this same list in softirq context. Suppose that the process-context scan +is referencing element B when it is interrupted by softirq processing, +which deletes element B, and then invokes call_rcu() to free element B +after a grace period. + +Now, if call_rcu() were to directly invoke its arguments, then upon return +from softirq, the list scan would find itself referencing a newly freed +element B. This situation can greatly decrease the life expectancy of +your kernel. + + +Example 2: Function-Call Fatality + +Of course, one could avert the suicide described in the preceding example +by having call_rcu() directly invoke its arguments only if it was called +from process context. However, this can fail in a similar manner. + +Suppose that an RCU-based algorithm again scans a linked list containing +elements A, B, and C in process contexts, but that it invokes a function +on each element as it is scanned. Suppose further that this function +deletes element B from the list, then passes it to call_rcu() for deferred +freeing. This may be a bit unconventional, but it is perfectly legal +RCU usage, since call_rcu() must wait for a grace period to elapse. +Therefore, in this case, allowing call_rcu() to immediately invoke +its arguments would cause it to fail to make the fundamental guarantee +underlying RCU, namely that call_rcu() defers invoking its arguments until +all RCU read-side critical sections currently executing have completed. + +Quick Quiz: why is it -not- legal to invoke synchronize_kernel() in +this case? + + +Summary + +Permitting call_rcu() to immediately invoke its arguments or permitting +synchronize_kernel() to immediately return breaks RCU, even on a UP system. +So do not do it! Even on a UP system, the RCU infrastructure -must- +respect grace periods. + + +Answer to Quick Quiz + +The calling function is scanning an RCU-protected linked list, and +is therefore within an RCU read-side critical section. Therefore, +the called function has been invoked within an RCU read-side critical +section, and is not permitted to block. diff --git a/Documentation/RCU/arrayRCU.txt b/Documentation/RCU/arrayRCU.txt new file mode 100644 index 000000000000..453ebe6953ee --- /dev/null +++ b/Documentation/RCU/arrayRCU.txt @@ -0,0 +1,141 @@ +Using RCU to Protect Read-Mostly Arrays + + +Although RCU is more commonly used to protect linked lists, it can +also be used to protect arrays. Three situations are as follows: + +1. Hash Tables + +2. Static Arrays + +3. Resizeable Arrays + +Each of these situations are discussed below. + + +Situation 1: Hash Tables + +Hash tables are often implemented as an array, where each array entry +has a linked-list hash chain. Each hash chain can be protected by RCU +as described in the listRCU.txt document. This approach also applies +to other array-of-list situations, such as radix trees. + + +Situation 2: Static Arrays + +Static arrays, where the data (rather than a pointer to the data) is +located in each array element, and where the array is never resized, +have not been used with RCU. Rik van Riel recommends using seqlock in +this situation, which would also have minimal read-side overhead as long +as updates are rare. + +Quick Quiz: Why is it so important that updates be rare when + using seqlock? + + +Situation 3: Resizeable Arrays + +Use of RCU for resizeable arrays is demonstrated by the grow_ary() +function used by the System V IPC code. The array is used to map from +semaphore, message-queue, and shared-memory IDs to the data structure +that represents the corresponding IPC construct. The grow_ary() +function does not acquire any locks; instead its caller must hold the +ids->sem semaphore. + +The grow_ary() function, shown below, does some limit checks, allocates a +new ipc_id_ary, copies the old to the new portion of the new, initializes +the remainder of the new, updates the ids->entries pointer to point to +the new array, and invokes ipc_rcu_putref() to free up the old array. +Note that rcu_assign_pointer() is used to update the ids->entries pointer, +which includes any memory barriers required on whatever architecture +you are running on. + + static int grow_ary(struct ipc_ids* ids, int newsize) + { + struct ipc_id_ary* new; + struct ipc_id_ary* old; + int i; + int size = ids->entries->size; + + if(newsize > IPCMNI) + newsize = IPCMNI; + if(newsize <= size) + return newsize; + + new = ipc_rcu_alloc(sizeof(struct kern_ipc_perm *)*newsize + + sizeof(struct ipc_id_ary)); + if(new == NULL) + return size; + new->size = newsize; + memcpy(new->p, ids->entries->p, + sizeof(struct kern_ipc_perm *)*size + + sizeof(struct ipc_id_ary)); + for(i=size;ip[i] = NULL; + } + old = ids->entries; + + /* + * Use rcu_assign_pointer() to make sure the memcpyed + * contents of the new array are visible before the new + * array becomes visible. + */ + rcu_assign_pointer(ids->entries, new); + + ipc_rcu_putref(old); + return newsize; + } + +The ipc_rcu_putref() function decrements the array's reference count +and then, if the reference count has dropped to zero, uses call_rcu() +to free the array after a grace period has elapsed. + +The array is traversed by the ipc_lock() function. This function +indexes into the array under the protection of rcu_read_lock(), +using rcu_dereference() to pick up the pointer to the array so +that it may later safely be dereferenced -- memory barriers are +required on the Alpha CPU. Since the size of the array is stored +with the array itself, there can be no array-size mismatches, so +a simple check suffices. The pointer to the structure corresponding +to the desired IPC object is placed in "out", with NULL indicating +a non-existent entry. After acquiring "out->lock", the "out->deleted" +flag indicates whether the IPC object is in the process of being +deleted, and, if not, the pointer is returned. + + struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id) + { + struct kern_ipc_perm* out; + int lid = id % SEQ_MULTIPLIER; + struct ipc_id_ary* entries; + + rcu_read_lock(); + entries = rcu_dereference(ids->entries); + if(lid >= entries->size) { + rcu_read_unlock(); + return NULL; + } + out = entries->p[lid]; + if(out == NULL) { + rcu_read_unlock(); + return NULL; + } + spin_lock(&out->lock); + + /* ipc_rmid() may have already freed the ID while ipc_lock + * was spinning: here verify that the structure is still valid + */ + if (out->deleted) { + spin_unlock(&out->lock); + rcu_read_unlock(); + return NULL; + } + return out; + } + + +Answer to Quick Quiz: + + The reason that it is important that updates be rare when + using seqlock is that frequent updates can livelock readers. + One way to avoid this problem is to assign a seqlock for + each array entry rather than to the entire array. diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt new file mode 100644 index 000000000000..b3a568abe6b1 --- /dev/null +++ b/Documentation/RCU/checklist.txt @@ -0,0 +1,157 @@ +Review Checklist for RCU Patches + + +This document contains a checklist for producing and reviewing patches +that make use of RCU. Violating any of the rules listed below will +result in the same sorts of problems that leaving out a locking primitive +would cause. This list is based on experiences reviewing such patches +over a rather long period of time, but improvements are always welcome! + +0. Is RCU being applied to a read-mostly situation? If the data + structure is updated more than about 10% of the time, then + you should strongly consider some other approach, unless + detailed performance measurements show that RCU is nonetheless + the right tool for the job. + + The other exception would be where performance is not an issue, + and RCU provides a simpler implementation. An example of this + situation is the dynamic NMI code in the Linux 2.6 kernel, + at least on architectures where NMIs are rare. + +1. Does the update code have proper mutual exclusion? + + RCU does allow -readers- to run (almost) naked, but -writers- must + still use some sort of mutual exclusion, such as: + + a. locking, + b. atomic operations, or + c. restricting updates to a single task. + + If you choose #b, be prepared to describe how you have handled + memory barriers on weakly ordered machines (pretty much all of + them -- even x86 allows reads to be reordered), and be prepared + to explain why this added complexity is worthwhile. If you + choose #c, be prepared to explain how this single task does not + become a major bottleneck on big multiprocessor machines. + +2. Do the RCU read-side critical sections make proper use of + rcu_read_lock() and friends? These primitives are needed + to suppress preemption (or bottom halves, in the case of + rcu_read_lock_bh()) in the read-side critical sections, + and are also an excellent aid to readability. + +3. Does the update code tolerate concurrent accesses? + + The whole point of RCU is to permit readers to run without + any locks or atomic operations. This means that readers will + be running while updates are in progress. There are a number + of ways to handle this concurrency, depending on the situation: + + a. Make updates appear atomic to readers. For example, + pointer updates to properly aligned fields will appear + atomic, as will individual atomic primitives. Operations + performed under a lock and sequences of multiple atomic + primitives will -not- appear to be atomic. + + This is almost always the best approach. + + b. Carefully order the updates and the reads so that + readers see valid data at all phases of the update. + This is often more difficult than it sounds, especially + given modern CPUs' tendency to reorder memory references. + One must usually liberally sprinkle memory barriers + (smp_wmb(), smp_rmb(), smp_mb()) through the code, + making it difficult to understand and to test. + + It is usually better to group the changing data into + a separate structure, so that the change may be made + to appear atomic by updating a pointer to reference + a new structure containing updated values. + +4. Weakly ordered CPUs pose special challenges. Almost all CPUs + are weakly ordered -- even i386 CPUs allow reads to be reordered. + RCU code must take all of the following measures to prevent + memory-corruption problems: + + a. Readers must maintain proper ordering of their memory + accesses. The rcu_dereference() primitive ensures that + the CPU picks up the pointer before it picks up the data + that the pointer points to. This really is necessary + on Alpha CPUs. If you don't believe me, see: + + http://www.openvms.compaq.com/wizard/wiz_2637.html + + The rcu_dereference() primitive is also an excellent + documentation aid, letting the person reading the code + know exactly which pointers are protected by RCU. + + The rcu_dereference() primitive is used by the various + "_rcu()" list-traversal primitives, such as the + list_for_each_entry_rcu(). + + b. If the list macros are being used, the list_del_rcu(), + list_add_tail_rcu(), and list_del_rcu() primitives must + be used in order to prevent weakly ordered machines from + misordering structure initialization and pointer planting. + Similarly, if the hlist macros are being used, the + hlist_del_rcu() and hlist_add_head_rcu() primitives + are required. + + c. Updates must ensure that initialization of a given + structure happens before pointers to that structure are + publicized. Use the rcu_assign_pointer() primitive + when publicizing a pointer to a structure that can + be traversed by an RCU read-side critical section. + + [The rcu_assign_pointer() primitive is in process.] + +5. If call_rcu(), or a related primitive such as call_rcu_bh(), + is used, the callback function must be written to be called + from softirq context. In particular, it cannot block. + +6. Since synchronize_kernel() blocks, it cannot be called from + any sort of irq context. + +7. If the updater uses call_rcu(), then the corresponding readers + must use rcu_read_lock() and rcu_read_unlock(). If the updater + uses call_rcu_bh(), then the corresponding readers must use + rcu_read_lock_bh() and rcu_read_unlock_bh(). Mixing things up + will result in confusion and broken kernels. + + One exception to this rule: rcu_read_lock() and rcu_read_unlock() + may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() + in cases where local bottom halves are already known to be + disabled, for example, in irq or softirq context. Commenting + such cases is a must, of course! And the jury is still out on + whether the increased speed is worth it. + +8. Although synchronize_kernel() is a bit slower than is call_rcu(), + it usually results in simpler code. So, unless update performance + is important or the updaters cannot block, synchronize_kernel() + should be used in preference to call_rcu(). + +9. All RCU list-traversal primitives, which include + list_for_each_rcu(), list_for_each_entry_rcu(), + list_for_each_continue_rcu(), and list_for_each_safe_rcu(), + must be within an RCU read-side critical section. RCU + read-side critical sections are delimited by rcu_read_lock() + and rcu_read_unlock(), or by similar primitives such as + rcu_read_lock_bh() and rcu_read_unlock_bh(). + + Use of the _rcu() list-traversal primitives outside of an + RCU read-side critical section causes no harm other than + a slight performance degradation on Alpha CPUs and some + confusion on the part of people trying to read the code. + + Another way of thinking of this is "If you are holding the + lock that prevents the data structure from changing, why do + you also need RCU-based protection?" That said, there may + well be situations where use of the _rcu() list-traversal + primitives while the update-side lock is held results in + simpler and more maintainable code. The jury is still out + on this question. + +10. Conversely, if you are in an RCU read-side critical section, + you -must- use the "_rcu()" variants of the list macros. + Failing to do so will break Alpha and confuse people reading + your code. diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt new file mode 100644 index 000000000000..bda6ead69bd0 --- /dev/null +++ b/Documentation/RCU/listRCU.txt @@ -0,0 +1,307 @@ +Using RCU to Protect Read-Mostly Linked Lists + + +One of the best applications of RCU is to protect read-mostly linked lists +("struct list_head" in list.h). One big advantage of this approach +is that all of the required memory barriers are included for you in +the list macros. This document describes several applications of RCU, +with the best fits first. + + +Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates + +The best applications are cases where, if reader-writer locking were +used, the read-side lock would be dropped before taking any action +based on the results of the search. The most celebrated example is +the routing table. Because the routing table is tracking the state of +equipment outside of the computer, it will at times contain stale data. +Therefore, once the route has been computed, there is no need to hold +the routing table static during transmission of the packet. After all, +you can hold the routing table static all you want, but that won't keep +the external Internet from changing, and it is the state of the external +Internet that really matters. In addition, routing entries are typically +added or deleted, rather than being modified in place. + +A straightforward example of this use of RCU may be found in the +system-call auditing support. For example, a reader-writer locked +implementation of audit_filter_task() might be as follows: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + read_lock(&auditsc_lock); + list_for_each_entry(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + read_unlock(&auditsc_lock); + return state; + } + } + read_unlock(&auditsc_lock); + return AUDIT_BUILD_CONTEXT; + } + +Here the list is searched under the lock, but the lock is dropped before +the corresponding value is returned. By the time that this value is acted +on, the list may well have been modified. This makes sense, since if +you are turning auditing off, it is OK to audit a few extra system calls. + +This means that RCU can be easily applied to the read side, as follows: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + rcu_read_lock(); + list_for_each_entry_rcu(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + rcu_read_unlock(); + return state; + } + } + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + +The read_lock() and read_unlock() calls have become rcu_read_lock() +and rcu_read_unlock(), respectively, and the list_for_each_entry() has +become list_for_each_entry_rcu(). The _rcu() list-traversal primitives +insert the read-side memory barriers that are required on DEC Alpha CPUs. + +The changes to the update side are also straightforward. A reader-writer +lock might be used as follows for deletion and insertion: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + write_lock(&auditsc_lock); + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + list_del(&e->list); + write_unlock(&auditsc_lock); + return 0; + } + } + write_unlock(&auditsc_lock); + return -EFAULT; /* No matching rule */ + } + + static inline int audit_add_rule(struct audit_entry *entry, + struct list_head *list) + { + write_lock(&auditsc_lock); + if (entry->rule.flags & AUDIT_PREPEND) { + entry->rule.flags &= ~AUDIT_PREPEND; + list_add(&entry->list, list); + } else { + list_add_tail(&entry->list, list); + } + write_unlock(&auditsc_lock); + return 0; + } + +Following are the RCU equivalents for these two functions: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + /* Do not use the _rcu iterator here, since this is the only + * deletion routine. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + list_del_rcu(&e->list); + call_rcu(&e->rcu, audit_free_rule, e); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + + static inline int audit_add_rule(struct audit_entry *entry, + struct list_head *list) + { + if (entry->rule.flags & AUDIT_PREPEND) { + entry->rule.flags &= ~AUDIT_PREPEND; + list_add_rcu(&entry->list, list); + } else { + list_add_tail_rcu(&entry->list, list); + } + return 0; + } + +Normally, the write_lock() and write_unlock() would be replaced by +a spin_lock() and a spin_unlock(), but in this case, all callers hold +audit_netlink_sem, so no additional locking is required. The auditsc_lock +can therefore be eliminated, since use of RCU eliminates the need for +writers to exclude readers. + +The list_del(), list_add(), and list_add_tail() primitives have been +replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu(). +The _rcu() list-manipulation primitives add memory barriers that are +needed on weakly ordered CPUs (most of them!). + +So, when readers can tolerate stale data and when entries are either added +or deleted, without in-place modification, it is very easy to use RCU! + + +Example 2: Handling In-Place Updates + +The system-call auditing code does not update auditing rules in place. +However, if it did, reader-writer-locked code to do so might look as +follows (presumably, the field_count is only permitted to decrease, +otherwise, the added fields would need to be filled in): + + static inline int audit_upd_rule(struct audit_rule *rule, + struct list_head *list, + __u32 newaction, + __u32 newfield_count) + { + struct audit_entry *e; + struct audit_newentry *ne; + + write_lock(&auditsc_lock); + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + e->rule.action = newaction; + e->rule.file_count = newfield_count; + write_unlock(&auditsc_lock); + return 0; + } + } + write_unlock(&auditsc_lock); + return -EFAULT; /* No matching rule */ + } + +The RCU version creates a copy, updates the copy, then replaces the old +entry with the newly updated entry. This sequence of actions, allowing +concurrent reads while doing a copy to perform an update, is what gives +RCU ("read-copy update") its name. The RCU code is as follows: + + static inline int audit_upd_rule(struct audit_rule *rule, + struct list_head *list, + __u32 newaction, + __u32 newfield_count) + { + struct audit_entry *e; + struct audit_newentry *ne; + + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + ne = kmalloc(sizeof(*entry), GFP_ATOMIC); + if (ne == NULL) + return -ENOMEM; + audit_copy_rule(&ne->rule, &e->rule); + ne->rule.action = newaction; + ne->rule.file_count = newfield_count; + list_add_rcu(ne, e); + list_del(e); + call_rcu(&e->rcu, audit_free_rule, e); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + +Again, this assumes that the caller holds audit_netlink_sem. Normally, +the reader-writer lock would become a spinlock in this sort of code. + + +Example 3: Eliminating Stale Data + +The auditing examples above tolerate stale data, as do most algorithms +that are tracking external state. Because there is a delay from the +time the external state changes before Linux becomes aware of the change, +additional RCU-induced staleness is normally not a problem. + +However, there are many examples where stale data cannot be tolerated. +One example in the Linux kernel is the System V IPC (see the ipc_lock() +function in ipc/util.c). This code checks a "deleted" flag under a +per-entry spinlock, and, if the "deleted" flag is set, pretends that the +entry does not exist. For this to be helpful, the search function must +return holding the per-entry spinlock, as ipc_lock() does in fact do. + +Quick Quiz: Why does the search function need to return holding the +per-entry lock for this deleted-flag technique to be helpful? + +If the system-call audit module were to ever need to reject stale data, +one way to accomplish this would be to add a "deleted" flag and a "lock" +spinlock to the audit_entry structure, and modify audit_filter_task() +as follows: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + rcu_read_lock(); + list_for_each_entry_rcu(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + spin_lock(&e->lock); + if (e->deleted) { + spin_unlock(&e->lock); + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + rcu_read_unlock(); + return state; + } + } + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + +Note that this example assumes that entries are only added and deleted. +Additional mechanism is required to deal correctly with the +update-in-place performed by audit_upd_rule(). For one thing, +audit_upd_rule() would need additional memory barriers to ensure +that the list_add_rcu() was really executed before the list_del_rcu(). + +The audit_del_rule() function would need to set the "deleted" +flag under the spinlock as follows: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + /* Do not use the _rcu iterator here, since this is the only + * deletion routine. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + spin_lock(&e->lock); + list_del_rcu(&e->list); + e->deleted = 1; + spin_unlock(&e->lock); + call_rcu(&e->rcu, audit_free_rule, e); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + + +Summary + +Read-mostly list-based data structures that can tolerate stale data are +the most amenable to use of RCU. The simplest case is where entries are +either added or deleted from the data structure (or atomically modified +in place), but non-atomic in-place modifications can be handled by making +a copy, updating the copy, then replacing the original with the copy. +If stale data cannot be tolerated, then a "deleted" flag may be used +in conjunction with a per-entry spinlock in order to allow the search +function to reject newly deleted data. + + +Answer to Quick Quiz + +If the search function drops the per-entry lock before returning, then +the caller will be processing stale data in any case. If it is really +OK to be processing stale data, then you don't need a "deleted" flag. +If processing stale data really is a problem, then you need to hold the +per-entry lock across all of the code that uses the value looked up. diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt new file mode 100644 index 000000000000..7e0c2ab6f2bd --- /dev/null +++ b/Documentation/RCU/rcu.txt @@ -0,0 +1,67 @@ +RCU Concepts + + +The basic idea behind RCU (read-copy update) is to split destructive +operations into two parts, one that prevents anyone from seeing the data +item being destroyed, and one that actually carries out the destruction. +A "grace period" must elapse between the two parts, and this grace period +must be long enough that any readers accessing the item being deleted have +since dropped their references. For example, an RCU-protected deletion +from a linked list would first remove the item from the list, wait for +a grace period to elapse, then free the element. See the listRCU.txt +file for more information on using RCU with linked lists. + + +Frequently Asked Questions + +o Why would anyone want to use RCU? + + The advantage of RCU's two-part approach is that RCU readers need + not acquire any locks, perform any atomic instructions, write to + shared memory, or (on CPUs other than Alpha) execute any memory + barriers. The fact that these operations are quite expensive + on modern CPUs is what gives RCU its performance advantages + in read-mostly situations. The fact that RCU readers need not + acquire locks can also greatly simplify deadlock-avoidance code. + +o How can the updater tell when a grace period has completed + if the RCU readers give no indication when they are done? + + Just as with spinlocks, RCU readers are not permitted to + block, switch to user-mode execution, or enter the idle loop. + Therefore, as soon as a CPU is seen passing through any of these + three states, we know that that CPU has exited any previous RCU + read-side critical sections. So, if we remove an item from a + linked list, and then wait until all CPUs have switched context, + executed in user mode, or executed in the idle loop, we can + safely free up that item. + +o If I am running on a uniprocessor kernel, which can only do one + thing at a time, why should I wait for a grace period? + + See the UP.txt file in this directory. + +o How can I see where RCU is currently used in the Linux kernel? + + Search for "rcu_read_lock", "call_rcu", and "synchronize_kernel". + +o What guidelines should I follow when writing code that uses RCU? + + See the checklist.txt file in this directory. + +o Why the name "RCU"? + + "RCU" stands for "read-copy update". The file listRCU.txt has + more information on where this name came from, search for + "read-copy update" to find it. + +o I hear that RCU is patented? What is with that? + + Yes, it is. There are several known patents related to RCU, + search for the string "Patent" in RTFP.txt to find them. + Of these, one was allowed to lapse by the assignee, and the + others have been contributed to the Linux kernel under GPL. + +o Where can I find more information on RCU? + + See the RTFP.txt file in this directory. diff --git a/Documentation/README.DAC960 b/Documentation/README.DAC960 new file mode 100644 index 000000000000..98ea617a0dd6 --- /dev/null +++ b/Documentation/README.DAC960 @@ -0,0 +1,756 @@ + Linux Driver for Mylex DAC960/AcceleRAID/eXtremeRAID PCI RAID Controllers + + Version 2.2.11 for Linux 2.2.19 + Version 2.4.11 for Linux 2.4.12 + + PRODUCTION RELEASE + + 11 October 2001 + + Leonard N. Zubkoff + Dandelion Digital + lnz@dandelion.com + + Copyright 1998-2001 by Leonard N. Zubkoff + + + INTRODUCTION + +Mylex, Inc. designs and manufactures a variety of high performance PCI RAID +controllers. Mylex Corporation is located at 34551 Ardenwood Blvd., Fremont, +California 94555, USA and can be reached at 510.796.6100 or on the World Wide +Web at http://www.mylex.com. Mylex Technical Support can be reached by +electronic mail at mylexsup@us.ibm.com, by voice at 510.608.2400, or by FAX at +510.745.7715. Contact information for offices in Europe and Japan is available +on their Web site. + +The latest information on Linux support for DAC960 PCI RAID Controllers, as +well as the most recent release of this driver, will always be available from +my Linux Home Page at URL "http://www.dandelion.com/Linux/". The Linux DAC960 +driver supports all current Mylex PCI RAID controllers including the new +eXtremeRAID 2000/3000 and AcceleRAID 352/170/160 models which have an entirely +new firmware interface from the older eXtremeRAID 1100, AcceleRAID 150/200/250, +and DAC960PJ/PG/PU/PD/PL. See below for a complete controller list as well as +minimum firmware version requirements. For simplicity, in most places this +documentation refers to DAC960 generically rather than explicitly listing all +the supported models. + +Driver bug reports should be sent via electronic mail to "lnz@dandelion.com". +Please include with the bug report the complete configuration messages reported +by the driver at startup, along with any subsequent system messages relevant to +the controller's operation, and a detailed description of your system's +hardware configuration. Driver bugs are actually quite rare; if you encounter +problems with disks being marked offline, for example, please contact Mylex +Technical Support as the problem is related to the hardware configuration +rather than the Linux driver. + +Please consult the RAID controller documentation for detailed information +regarding installation and configuration of the controllers. This document +primarily provides information specific to the Linux support. + + + DRIVER FEATURES + +The DAC960 RAID controllers are supported solely as high performance RAID +controllers, not as interfaces to arbitrary SCSI devices. The Linux DAC960 +driver operates at the block device level, the same level as the SCSI and IDE +drivers. Unlike other RAID controllers currently supported on Linux, the +DAC960 driver is not dependent on the SCSI subsystem, and hence avoids all the +complexity and unnecessary code that would be associated with an implementation +as a SCSI driver. The DAC960 driver is designed for as high a performance as +possible with no compromises or extra code for compatibility with lower +performance devices. The DAC960 driver includes extensive error logging and +online configuration management capabilities. Except for initial configuration +of the controller and adding new disk drives, most everything can be handled +from Linux while the system is operational. + +The DAC960 driver is architected to support up to 8 controllers per system. +Each DAC960 parallel SCSI controller can support up to 15 disk drives per +channel, for a maximum of 60 drives on a four channel controller; the fibre +channel eXtremeRAID 3000 controller supports up to 125 disk drives per loop for +a total of 250 drives. The drives installed on a controller are divided into +one or more "Drive Groups", and then each Drive Group is subdivided further +into 1 to 32 "Logical Drives". Each Logical Drive has a specific RAID Level +and caching policy associated with it, and it appears to Linux as a single +block device. Logical Drives are further subdivided into up to 7 partitions +through the normal Linux and PC disk partitioning schemes. Logical Drives are +also known as "System Drives", and Drive Groups are also called "Packs". Both +terms are in use in the Mylex documentation; I have chosen to standardize on +the more generic "Logical Drive" and "Drive Group". + +DAC960 RAID disk devices are named in the style of the Device File System +(DEVFS). The device corresponding to Logical Drive D on Controller C is +referred to as /dev/rd/cCdD, and the partitions are called /dev/rd/cCdDp1 +through /dev/rd/cCdDp7. For example, partition 3 of Logical Drive 5 on +Controller 2 is referred to as /dev/rd/c2d5p3. Note that unlike with SCSI +disks the device names will not change in the event of a disk drive failure. +The DAC960 driver is assigned major numbers 48 - 55 with one major number per +controller. The 8 bits of minor number are divided into 5 bits for the Logical +Drive and 3 bits for the partition. + + + SUPPORTED DAC960/AcceleRAID/eXtremeRAID PCI RAID CONTROLLERS + +The following list comprises the supported DAC960, AcceleRAID, and eXtremeRAID +PCI RAID Controllers as of the date of this document. It is recommended that +anyone purchasing a Mylex PCI RAID Controller not in the following table +contact the author beforehand to verify that it is or will be supported. + +eXtremeRAID 3000 + 1 Wide Ultra-2/LVD SCSI channel + 2 External Fibre FC-AL channels + 233MHz StrongARM SA 110 Processor + 64 Bit 33MHz PCI (backward compatible with 32 Bit PCI slots) + 32MB/64MB ECC SDRAM Memory + +eXtremeRAID 2000 + 4 Wide Ultra-160 LVD SCSI channels + 233MHz StrongARM SA 110 Processor + 64 Bit 33MHz PCI (backward compatible with 32 Bit PCI slots) + 32MB/64MB ECC SDRAM Memory + +AcceleRAID 352 + 2 Wide Ultra-160 LVD SCSI channels + 100MHz Intel i960RN RISC Processor + 64 Bit 33MHz PCI (backward compatible with 32 Bit PCI slots) + 32MB/64MB ECC SDRAM Memory + +AcceleRAID 170 + 1 Wide Ultra-160 LVD SCSI channel + 100MHz Intel i960RM RISC Processor + 16MB/32MB/64MB ECC SDRAM Memory + +AcceleRAID 160 (AcceleRAID 170LP) + 1 Wide Ultra-160 LVD SCSI channel + 100MHz Intel i960RS RISC Processor + Built in 16M ECC SDRAM Memory + PCI Low Profile Form Factor - fit for 2U height + +eXtremeRAID 1100 (DAC1164P) + 3 Wide Ultra-2/LVD SCSI channels + 233MHz StrongARM SA 110 Processor + 64 Bit 33MHz PCI (backward compatible with 32 Bit PCI slots) + 16MB/32MB/64MB Parity SDRAM Memory with Battery Backup + +AcceleRAID 250 (DAC960PTL1) + Uses onboard Symbios SCSI chips on certain motherboards + Also includes one onboard Wide Ultra-2/LVD SCSI Channel + 66MHz Intel i960RD RISC Processor + 4MB/8MB/16MB/32MB/64MB/128MB ECC EDO Memory + +AcceleRAID 200 (DAC960PTL0) + Uses onboard Symbios SCSI chips on certain motherboards + Includes no onboard SCSI Channels + 66MHz Intel i960RD RISC Processor + 4MB/8MB/16MB/32MB/64MB/128MB ECC EDO Memory + +AcceleRAID 150 (DAC960PRL) + Uses onboard Symbios SCSI chips on certain motherboards + Also includes one onboard Wide Ultra-2/LVD SCSI Channel + 33MHz Intel i960RP RISC Processor + 4MB Parity EDO Memory + +DAC960PJ 1/2/3 Wide Ultra SCSI-3 Channels + 66MHz Intel i960RD RISC Processor + 4MB/8MB/16MB/32MB/64MB/128MB ECC EDO Memory + +DAC960PG 1/2/3 Wide Ultra SCSI-3 Channels + 33MHz Intel i960RP RISC Processor + 4MB/8MB ECC EDO Memory + +DAC960PU 1/2/3 Wide Ultra SCSI-3 Channels + Intel i960CF RISC Processor + 4MB/8MB EDRAM or 2MB/4MB/8MB/16MB/32MB DRAM Memory + +DAC960PD 1/2/3 Wide Fast SCSI-2 Channels + Intel i960CF RISC Processor + 4MB/8MB EDRAM or 2MB/4MB/8MB/16MB/32MB DRAM Memory + +DAC960PL 1/2/3 Wide Fast SCSI-2 Channels + Intel i960 RISC Processor + 2MB/4MB/8MB/16MB/32MB DRAM Memory + +DAC960P 1/2/3 Wide Fast SCSI-2 Channels + Intel i960 RISC Processor + 2MB/4MB/8MB/16MB/32MB DRAM Memory + +For the eXtremeRAID 2000/3000 and AcceleRAID 352/170/160, firmware version +6.00-01 or above is required. + +For the eXtremeRAID 1100, firmware version 5.06-0-52 or above is required. + +For the AcceleRAID 250, 200, and 150, firmware version 4.06-0-57 or above is +required. + +For the DAC960PJ and DAC960PG, firmware version 4.06-0-00 or above is required. + +For the DAC960PU, DAC960PD, DAC960PL, and DAC960P, either firmware version +3.51-0-04 or above is required (for dual Flash ROM controllers), or firmware +version 2.73-0-00 or above is required (for single Flash ROM controllers) + +Please note that not all SCSI disk drives are suitable for use with DAC960 +controllers, and only particular firmware versions of any given model may +actually function correctly. Similarly, not all motherboards have a BIOS that +properly initializes the AcceleRAID 250, AcceleRAID 200, AcceleRAID 150, +DAC960PJ, and DAC960PG because the Intel i960RD/RP is a multi-function device. +If in doubt, contact Mylex RAID Technical Support (mylexsup@us.ibm.com) to +verify compatibility. Mylex makes available a hard disk compatibility list at +http://www.mylex.com/support/hdcomp/hd-lists.html. + + + DRIVER INSTALLATION + +This distribution was prepared for Linux kernel version 2.2.19 or 2.4.12. + +To install the DAC960 RAID driver, you may use the following commands, +replacing "/usr/src" with wherever you keep your Linux kernel source tree: + + cd /usr/src + tar -xvzf DAC960-2.2.11.tar.gz (or DAC960-2.4.11.tar.gz) + mv README.DAC960 linux/Documentation + mv DAC960.[ch] linux/drivers/block + patch -p0 < DAC960.patch (if DAC960.patch is included) + cd linux + make config + make bzImage (or zImage) + +Then install "arch/i386/boot/bzImage" or "arch/i386/boot/zImage" as your +standard kernel, run lilo if appropriate, and reboot. + +To create the necessary devices in /dev, the "make_rd" script included in +"DAC960-Utilities.tar.gz" from http://www.dandelion.com/Linux/ may be used. +LILO 21 and FDISK v2.9 include DAC960 support; also included in this archive +are patches to LILO 20 and FDISK v2.8 that add DAC960 support, along with +statically linked executables of LILO and FDISK. This modified version of LILO +will allow booting from a DAC960 controller and/or mounting the root file +system from a DAC960. + +Red Hat Linux 6.0 and SuSE Linux 6.1 include support for Mylex PCI RAID +controllers. Installing directly onto a DAC960 may be problematic from other +Linux distributions until their installation utilities are updated. + + + INSTALLATION NOTES + +Before installing Linux or adding DAC960 logical drives to an existing Linux +system, the controller must first be configured to provide one or more logical +drives using the BIOS Configuration Utility or DACCF. Please note that since +there are only at most 6 usable partitions on each logical drive, systems +requiring more partitions should subdivide a drive group into multiple logical +drives, each of which can have up to 6 usable partitions. Also, note that with +large disk arrays it is advisable to enable the 8GB BIOS Geometry (255/63) +rather than accepting the default 2GB BIOS Geometry (128/32); failing to so do +will cause the logical drive geometry to have more than 65535 cylinders which +will make it impossible for FDISK to be used properly. The 8GB BIOS Geometry +can be enabled by configuring the DAC960 BIOS, which is accessible via Alt-M +during the BIOS initialization sequence. + +For maximum performance and the most efficient E2FSCK performance, it is +recommended that EXT2 file systems be built with a 4KB block size and 16 block +stride to match the DAC960 controller's 64KB default stripe size. The command +"mke2fs -b 4096 -R stride=16 " is appropriate. Unless there will be a +large number of small files on the file systems, it is also beneficial to add +the "-i 16384" option to increase the bytes per inode parameter thereby +reducing the file system metadata. Finally, on systems that will only be run +with Linux 2.2 or later kernels it is beneficial to enable sparse superblocks +with the "-s 1" option. + + + DAC960 ANNOUNCEMENTS MAILING LIST + +The DAC960 Announcements Mailing List provides a forum for informing Linux +users of new driver releases and other announcements regarding Linux support +for DAC960 PCI RAID Controllers. To join the mailing list, send a message to +"dac960-announce-request@dandelion.com" with the line "subscribe" in the +message body. + + + CONTROLLER CONFIGURATION AND STATUS MONITORING + +The DAC960 RAID controllers running firmware 4.06 or above include a Background +Initialization facility so that system downtime is minimized both for initial +installation and subsequent configuration of additional storage. The BIOS +Configuration Utility (accessible via Alt-R during the BIOS initialization +sequence) is used to quickly configure the controller, and then the logical +drives that have been created are available for immediate use even while they +are still being initialized by the controller. The primary need for online +configuration and status monitoring is then to avoid system downtime when disk +drives fail and must be replaced. Mylex's online monitoring and configuration +utilities are being ported to Linux and will become available at some point in +the future. Note that with a SAF-TE (SCSI Accessed Fault-Tolerant Enclosure) +enclosure, the controller is able to rebuild failed drives automatically as +soon as a drive replacement is made available. + +The primary interfaces for controller configuration and status monitoring are +special files created in the /proc/rd/... hierarchy along with the normal +system console logging mechanism. Whenever the system is operating, the DAC960 +driver queries each controller for status information every 10 seconds, and +checks for additional conditions every 60 seconds. The initial status of each +controller is always available for controller N in /proc/rd/cN/initial_status, +and the current status as of the last status monitoring query is available in +/proc/rd/cN/current_status. In addition, status changes are also logged by the +driver to the system console and will appear in the log files maintained by +syslog. The progress of asynchronous rebuild or consistency check operations +is also available in /proc/rd/cN/current_status, and progress messages are +logged to the system console at most every 60 seconds. + +Starting with the 2.2.3/2.0.3 versions of the driver, the status information +available in /proc/rd/cN/initial_status and /proc/rd/cN/current_status has been +augmented to include the vendor, model, revision, and serial number (if +available) for each physical device found connected to the controller: + +***** DAC960 RAID Driver Version 2.2.3 of 19 August 1999 ***** +Copyright 1998-1999 by Leonard N. Zubkoff +Configuring Mylex DAC960PRL PCI RAID Controller + Firmware Version: 4.07-0-07, Channels: 1, Memory Size: 16MB + PCI Bus: 1, Device: 4, Function: 1, I/O Address: Unassigned + PCI Address: 0xFE300000 mapped at 0xA0800000, IRQ Channel: 21 + Controller Queue Depth: 128, Maximum Blocks per Command: 128 + Driver Queue Depth: 127, Maximum Scatter/Gather Segments: 33 + Stripe Size: 64KB, Segment Size: 8KB, BIOS Geometry: 255/63 + SAF-TE Enclosure Management Enabled + Physical Devices: + 0:0 Vendor: IBM Model: DRVS09D Revision: 0270 + Serial Number: 68016775HA + Disk Status: Online, 17928192 blocks + 0:1 Vendor: IBM Model: DRVS09D Revision: 0270 + Serial Number: 68004E53HA + Disk Status: Online, 17928192 blocks + 0:2 Vendor: IBM Model: DRVS09D Revision: 0270 + Serial Number: 13013935HA + Disk Status: Online, 17928192 blocks + 0:3 Vendor: IBM Model: DRVS09D Revision: 0270 + Serial Number: 13016897HA + Disk Status: Online, 17928192 blocks + 0:4 Vendor: IBM Model: DRVS09D Revision: 0270 + Serial Number: 68019905HA + Disk Status: Online, 17928192 blocks + 0:5 Vendor: IBM Model: DRVS09D Revision: 0270 + Serial Number: 68012753HA + Disk Status: Online, 17928192 blocks + 0:6 Vendor: ESG-SHV Model: SCA HSBP M6 Revision: 0.61 + Logical Drives: + /dev/rd/c0d0: RAID-5, Online, 89640960 blocks, Write Thru + No Rebuild or Consistency Check in Progress + +To simplify the monitoring process for custom software, the special file +/proc/rd/status returns "OK" when all DAC960 controllers in the system are +operating normally and no failures have occurred, or "ALERT" if any logical +drives are offline or critical or any non-standby physical drives are dead. + +Configuration commands for controller N are available via the special file +/proc/rd/cN/user_command. A human readable command can be written to this +special file to initiate a configuration operation, and the results of the +operation can then be read back from the special file in addition to being +logged to the system console. The shell command sequence + + echo "" > /proc/rd/c0/user_command + cat /proc/rd/c0/user_command + +is typically used to execute configuration commands. The configuration +commands are: + + flush-cache + + The "flush-cache" command flushes the controller's cache. The system + automatically flushes the cache at shutdown or if the driver module is + unloaded, so this command is only needed to be certain a write back cache + is flushed to disk before the system is powered off by a command to a UPS. + Note that the flush-cache command also stops an asynchronous rebuild or + consistency check, so it should not be used except when the system is being + halted. + + kill : + + The "kill" command marks the physical drive : as DEAD. + This command is provided primarily for testing, and should not be used + during normal system operation. + + make-online : + + The "make-online" command changes the physical drive : + from status DEAD to status ONLINE. In cases where multiple physical drives + have been killed simultaneously, this command may be used to bring all but + one of them back online, after which a rebuild to the final drive is + necessary. + + Warning: make-online should only be used on a dead physical drive that is + an active part of a drive group, never on a standby drive. The command + should never be used on a dead drive that is part of a critical logical + drive; rebuild should be used if only a single drive is dead. + + make-standby : + + The "make-standby" command changes physical drive : + from status DEAD to status STANDBY. It should only be used in cases where + a dead drive was replaced after an automatic rebuild was performed onto a + standby drive. It cannot be used to add a standby drive to the controller + configuration if one was not created initially; the BIOS Configuration + Utility must be used for that currently. + + rebuild : + + The "rebuild" command initiates an asynchronous rebuild onto physical drive + :. It should only be used when a dead drive has been + replaced. + + check-consistency + + The "check-consistency" command initiates an asynchronous consistency check + of with automatic restoration. It can be used + whenever it is desired to verify the consistency of the redundancy + information. + + cancel-rebuild + cancel-consistency-check + + The "cancel-rebuild" and "cancel-consistency-check" commands cancel any + rebuild or consistency check operations previously initiated. + + + EXAMPLE I - DRIVE FAILURE WITHOUT A STANDBY DRIVE + +The following annotated logs demonstrate the controller configuration and and +online status monitoring capabilities of the Linux DAC960 Driver. The test +configuration comprises 6 1GB Quantum Atlas I disk drives on two channels of a +DAC960PJ controller. The physical drives are configured into a single drive +group without a standby drive, and the drive group has been configured into two +logical drives, one RAID-5 and one RAID-6. Note that these logs are from an +earlier version of the driver and the messages have changed somewhat with newer +releases, but the functionality remains similar. First, here is the current +status of the RAID configuration: + +gwynedd:/u/lnz# cat /proc/rd/c0/current_status +***** DAC960 RAID Driver Version 2.0.0 of 23 March 1999 ***** +Copyright 1998-1999 by Leonard N. Zubkoff +Configuring Mylex DAC960PJ PCI RAID Controller + Firmware Version: 4.06-0-08, Channels: 3, Memory Size: 8MB + PCI Bus: 0, Device: 19, Function: 1, I/O Address: Unassigned + PCI Address: 0xFD4FC000 mapped at 0x8807000, IRQ Channel: 9 + Controller Queue Depth: 128, Maximum Blocks per Command: 128 + Driver Queue Depth: 127, Maximum Scatter/Gather Segments: 33 + Stripe Size: 64KB, Segment Size: 8KB, BIOS Geometry: 255/63 + Physical Devices: + 0:1 - Disk: Online, 2201600 blocks + 0:2 - Disk: Online, 2201600 blocks + 0:3 - Disk: Online, 2201600 blocks + 1:1 - Disk: Online, 2201600 blocks + 1:2 - Disk: Online, 2201600 blocks + 1:3 - Disk: Online, 2201600 blocks + Logical Drives: + /dev/rd/c0d0: RAID-5, Online, 5498880 blocks, Write Thru + /dev/rd/c0d1: RAID-6, Online, 3305472 blocks, Write Thru + No Rebuild or Consistency Check in Progress + +gwynedd:/u/lnz# cat /proc/rd/status +OK + +The above messages indicate that everything is healthy, and /proc/rd/status +returns "OK" indicating that there are no problems with any DAC960 controller +in the system. For demonstration purposes, while I/O is active Physical Drive +1:1 is now disconnected, simulating a drive failure. The failure is noted by +the driver within 10 seconds of the controller's having detected it, and the +driver logs the following console status messages indicating that Logical +Drives 0 and 1 are now CRITICAL as a result of Physical Drive 1:1 being DEAD: + +DAC960#0: Physical Drive 1:2 Error Log: Sense Key = 6, ASC = 29, ASCQ = 02 +DAC960#0: Physical Drive 1:3 Error Log: Sense Key = 6, ASC = 29, ASCQ = 02 +DAC960#0: Physical Drive 1:1 killed because of timeout on SCSI command +DAC960#0: Physical Drive 1:1 is now DEAD +DAC960#0: Logical Drive 0 (/dev/rd/c0d0) is now CRITICAL +DAC960#0: Logical Drive 1 (/dev/rd/c0d1) is now CRITICAL + +The Sense Keys logged here are just Check Condition / Unit Attention conditions +arising from a SCSI bus reset that is forced by the controller during its error +recovery procedures. Concurrently with the above, the driver status available +from /proc/rd also reflects the drive failure. The status message in +/proc/rd/status has changed from "OK" to "ALERT": + +gwynedd:/u/lnz# cat /proc/rd/status +ALERT + +and /proc/rd/c0/current_status has been updated: + +gwynedd:/u/lnz# cat /proc/rd/c0/current_status + ... + Physical Devices: + 0:1 - Disk: Online, 2201600 blocks + 0:2 - Disk: Online, 2201600 blocks + 0:3 - Disk: Online, 2201600 blocks + 1:1 - Disk: Dead, 2201600 blocks + 1:2 - Disk: Online, 2201600 blocks + 1:3 - Disk: Online, 2201600 blocks + Logical Drives: + /dev/rd/c0d0: RAID-5, Critical, 5498880 blocks, Write Thru + /dev/rd/c0d1: RAID-6, Critical, 3305472 blocks, Write Thru + No Rebuild or Consistency Check in Progress + +Since there are no standby drives configured, the system can continue to access +the logical drives in a performance degraded mode until the failed drive is +replaced and a rebuild operation completed to restore the redundancy of the +logical drives. Once Physical Drive 1:1 is replaced with a properly +functioning drive, or if the physical drive was killed without having failed +(e.g., due to electrical problems on the SCSI bus), the user can instruct the +controller to initiate a rebuild operation onto the newly replaced drive: + +gwynedd:/u/lnz# echo "rebuild 1:1" > /proc/rd/c0/user_command +gwynedd:/u/lnz# cat /proc/rd/c0/user_command +Rebuild of Physical Drive 1:1 Initiated + +The echo command instructs the controller to initiate an asynchronous rebuild +operation onto Physical Drive 1:1, and the status message that results from the +operation is then available for reading from /proc/rd/c0/user_command, as well +as being logged to the console by the driver. + +Within 10 seconds of this command the driver logs the initiation of the +asynchronous rebuild operation: + +DAC960#0: Rebuild of Physical Drive 1:1 Initiated +DAC960#0: Physical Drive 1:1 Error Log: Sense Key = 6, ASC = 29, ASCQ = 01 +DAC960#0: Physical Drive 1:1 is now WRITE-ONLY +DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 1% completed + +and /proc/rd/c0/current_status is updated: + +gwynedd:/u/lnz# cat /proc/rd/c0/current_status + ... + Physical Devices: + 0:1 - Disk: Online, 2201600 blocks + 0:2 - Disk: Online, 2201600 blocks + 0:3 - Disk: Online, 2201600 blocks + 1:1 - Disk: Write-Only, 2201600 blocks + 1:2 - Disk: Online, 2201600 blocks + 1:3 - Disk: Online, 2201600 blocks + Logical Drives: + /dev/rd/c0d0: RAID-5, Critical, 5498880 blocks, Write Thru + /dev/rd/c0d1: RAID-6, Critical, 3305472 blocks, Write Thru + Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 6% completed + +As the rebuild progresses, the current status in /proc/rd/c0/current_status is +updated every 10 seconds: + +gwynedd:/u/lnz# cat /proc/rd/c0/current_status + ... + Physical Devices: + 0:1 - Disk: Online, 2201600 blocks + 0:2 - Disk: Online, 2201600 blocks + 0:3 - Disk: Online, 2201600 blocks + 1:1 - Disk: Write-Only, 2201600 blocks + 1:2 - Disk: Online, 2201600 blocks + 1:3 - Disk: Online, 2201600 blocks + Logical Drives: + /dev/rd/c0d0: RAID-5, Critical, 5498880 blocks, Write Thru + /dev/rd/c0d1: RAID-6, Critical, 3305472 blocks, Write Thru + Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 15% completed + +and every minute a progress message is logged to the console by the driver: + +DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 32% completed +DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 63% completed +DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 94% completed +DAC960#0: Rebuild in Progress: Logical Drive 1 (/dev/rd/c0d1) 94% completed + +Finally, the rebuild completes successfully. The driver logs the status of the +logical and physical drives and the rebuild completion: + +DAC960#0: Rebuild Completed Successfully +DAC960#0: Physical Drive 1:1 is now ONLINE +DAC960#0: Logical Drive 0 (/dev/rd/c0d0) is now ONLINE +DAC960#0: Logical Drive 1 (/dev/rd/c0d1) is now ONLINE + +/proc/rd/c0/current_status is updated: + +gwynedd:/u/lnz# cat /proc/rd/c0/current_status + ... + Physical Devices: + 0:1 - Disk: Online, 2201600 blocks + 0:2 - Disk: Online, 2201600 blocks + 0:3 - Disk: Online, 2201600 blocks + 1:1 - Disk: Online, 2201600 blocks + 1:2 - Disk: Online, 2201600 blocks + 1:3 - Disk: Online, 2201600 blocks + Logical Drives: + /dev/rd/c0d0: RAID-5, Online, 5498880 blocks, Write Thru + /dev/rd/c0d1: RAID-6, Online, 3305472 blocks, Write Thru + Rebuild Completed Successfully + +and /proc/rd/status indicates that everything is healthy once again: + +gwynedd:/u/lnz# cat /proc/rd/status +OK + + + EXAMPLE II - DRIVE FAILURE WITH A STANDBY DRIVE + +The following annotated logs demonstrate the controller configuration and and +online status monitoring capabilities of the Linux DAC960 Driver. The test +configuration comprises 6 1GB Quantum Atlas I disk drives on two channels of a +DAC960PJ controller. The physical drives are configured into a single drive +group with a standby drive, and the drive group has been configured into two +logical drives, one RAID-5 and one RAID-6. Note that these logs are from an +earlier version of the driver and the messages have changed somewhat with newer +releases, but the functionality remains similar. First, here is the current +status of the RAID configuration: + +gwynedd:/u/lnz# cat /proc/rd/c0/current_status +***** DAC960 RAID Driver Version 2.0.0 of 23 March 1999 ***** +Copyright 1998-1999 by Leonard N. Zubkoff +Configuring Mylex DAC960PJ PCI RAID Controller + Firmware Version: 4.06-0-08, Channels: 3, Memory Size: 8MB + PCI Bus: 0, Device: 19, Function: 1, I/O Address: Unassigned + PCI Address: 0xFD4FC000 mapped at 0x8807000, IRQ Channel: 9 + Controller Queue Depth: 128, Maximum Blocks per Command: 128 + Driver Queue Depth: 127, Maximum Scatter/Gather Segments: 33 + Stripe Size: 64KB, Segment Size: 8KB, BIOS Geometry: 255/63 + Physical Devices: + 0:1 - Disk: Online, 2201600 blocks + 0:2 - Disk: Online, 2201600 blocks + 0:3 - Disk: Online, 2201600 blocks + 1:1 - Disk: Online, 2201600 blocks + 1:2 - Disk: Online, 2201600 blocks + 1:3 - Disk: Standby, 2201600 blocks + Logical Drives: + /dev/rd/c0d0: RAID-5, Online, 4399104 blocks, Write Thru + /dev/rd/c0d1: RAID-6, Online, 2754560 blocks, Write Thru + No Rebuild or Consistency Check in Progress + +gwynedd:/u/lnz# cat /proc/rd/status +OK + +The above messages indicate that everything is healthy, and /proc/rd/status +returns "OK" indicating that there are no problems with any DAC960 controller +in the system. For demonstration purposes, while I/O is active Physical Drive +1:2 is now disconnected, simulating a drive failure. The failure is noted by +the driver within 10 seconds of the controller's having detected it, and the +driver logs the following console status messages: + +DAC960#0: Physical Drive 1:1 Error Log: Sense Key = 6, ASC = 29, ASCQ = 02 +DAC960#0: Physical Drive 1:3 Error Log: Sense Key = 6, ASC = 29, ASCQ = 02 +DAC960#0: Physical Drive 1:2 killed because of timeout on SCSI command +DAC960#0: Physical Drive 1:2 is now DEAD +DAC960#0: Physical Drive 1:2 killed because it was removed +DAC960#0: Logical Drive 0 (/dev/rd/c0d0) is now CRITICAL +DAC960#0: Logical Drive 1 (/dev/rd/c0d1) is now CRITICAL + +Since a standby drive is configured, the controller automatically begins +rebuilding onto the standby drive: + +DAC960#0: Physical Drive 1:3 is now WRITE-ONLY +DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 4% completed + +Concurrently with the above, the driver status available from /proc/rd also +reflects the drive failure and automatic rebuild. The status message in +/proc/rd/status has changed from "OK" to "ALERT": + +gwynedd:/u/lnz# cat /proc/rd/status +ALERT + +and /proc/rd/c0/current_status has been updated: + +gwynedd:/u/lnz# cat /proc/rd/c0/current_status + ... + Physical Devices: + 0:1 - Disk: Online, 2201600 blocks + 0:2 - Disk: Online, 2201600 blocks + 0:3 - Disk: Online, 2201600 blocks + 1:1 - Disk: Online, 2201600 blocks + 1:2 - Disk: Dead, 2201600 blocks + 1:3 - Disk: Write-Only, 2201600 blocks + Logical Drives: + /dev/rd/c0d0: RAID-5, Critical, 4399104 blocks, Write Thru + /dev/rd/c0d1: RAID-6, Critical, 2754560 blocks, Write Thru + Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 4% completed + +As the rebuild progresses, the current status in /proc/rd/c0/current_status is +updated every 10 seconds: + +gwynedd:/u/lnz# cat /proc/rd/c0/current_status + ... + Physical Devices: + 0:1 - Disk: Online, 2201600 blocks + 0:2 - Disk: Online, 2201600 blocks + 0:3 - Disk: Online, 2201600 blocks + 1:1 - Disk: Online, 2201600 blocks + 1:2 - Disk: Dead, 2201600 blocks + 1:3 - Disk: Write-Only, 2201600 blocks + Logical Drives: + /dev/rd/c0d0: RAID-5, Critical, 4399104 blocks, Write Thru + /dev/rd/c0d1: RAID-6, Critical, 2754560 blocks, Write Thru + Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 40% completed + +and every minute a progress message is logged on the console by the driver: + +DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 40% completed +DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 76% completed +DAC960#0: Rebuild in Progress: Logical Drive 1 (/dev/rd/c0d1) 66% completed +DAC960#0: Rebuild in Progress: Logical Drive 1 (/dev/rd/c0d1) 84% completed + +Finally, the rebuild completes successfully. The driver logs the status of the +logical and physical drives and the rebuild completion: + +DAC960#0: Rebuild Completed Successfully +DAC960#0: Physical Drive 1:3 is now ONLINE +DAC960#0: Logical Drive 0 (/dev/rd/c0d0) is now ONLINE +DAC960#0: Logical Drive 1 (/dev/rd/c0d1) is now ONLINE + +/proc/rd/c0/current_status is updated: + +***** DAC960 RAID Driver Version 2.0.0 of 23 March 1999 ***** +Copyright 1998-1999 by Leonard N. Zubkoff +Configuring Mylex DAC960PJ PCI RAID Controller + Firmware Version: 4.06-0-08, Channels: 3, Memory Size: 8MB + PCI Bus: 0, Device: 19, Function: 1, I/O Address: Unassigned + PCI Address: 0xFD4FC000 mapped at 0x8807000, IRQ Channel: 9 + Controller Queue Depth: 128, Maximum Blocks per Command: 128 + Driver Queue Depth: 127, Maximum Scatter/Gather Segments: 33 + Stripe Size: 64KB, Segment Size: 8KB, BIOS Geometry: 255/63 + Physical Devices: + 0:1 - Disk: Online, 2201600 blocks + 0:2 - Disk: Online, 2201600 blocks + 0:3 - Disk: Online, 2201600 blocks + 1:1 - Disk: Online, 2201600 blocks + 1:2 - Disk: Dead, 2201600 blocks + 1:3 - Disk: Online, 2201600 blocks + Logical Drives: + /dev/rd/c0d0: RAID-5, Online, 4399104 blocks, Write Thru + /dev/rd/c0d1: RAID-6, Online, 2754560 blocks, Write Thru + Rebuild Completed Successfully + +and /proc/rd/status indicates that everything is healthy once again: + +gwynedd:/u/lnz# cat /proc/rd/status +OK + +Note that the absence of a viable standby drive does not create an "ALERT" +status. Once dead Physical Drive 1:2 has been replaced, the controller must be +told that this has occurred and that the newly replaced drive should become the +new standby drive: + +gwynedd:/u/lnz# echo "make-standby 1:2" > /proc/rd/c0/user_command +gwynedd:/u/lnz# cat /proc/rd/c0/user_command +Make Standby of Physical Drive 1:2 Succeeded + +The echo command instructs the controller to make Physical Drive 1:2 into a +standby drive, and the status message that results from the operation is then +available for reading from /proc/rd/c0/user_command, as well as being logged to +the console by the driver. Within 60 seconds of this command the driver logs: + +DAC960#0: Physical Drive 1:2 Error Log: Sense Key = 6, ASC = 29, ASCQ = 01 +DAC960#0: Physical Drive 1:2 is now STANDBY +DAC960#0: Make Standby of Physical Drive 1:2 Succeeded + +and /proc/rd/c0/current_status is updated: + +gwynedd:/u/lnz# cat /proc/rd/c0/current_status + ... + Physical Devices: + 0:1 - Disk: Online, 2201600 blocks + 0:2 - Disk: Online, 2201600 blocks + 0:3 - Disk: Online, 2201600 blocks + 1:1 - Disk: Online, 2201600 blocks + 1:2 - Disk: Standby, 2201600 blocks + 1:3 - Disk: Online, 2201600 blocks + Logical Drives: + /dev/rd/c0d0: RAID-5, Online, 4399104 blocks, Write Thru + /dev/rd/c0d1: RAID-6, Online, 2754560 blocks, Write Thru + Rebuild Completed Successfully diff --git a/Documentation/README.cycladesZ b/Documentation/README.cycladesZ new file mode 100644 index 000000000000..024a69443cc2 --- /dev/null +++ b/Documentation/README.cycladesZ @@ -0,0 +1,8 @@ + +The Cyclades-Z must have firmware loaded onto the card before it will +operate. This operation should be performed during system startup, + +The firmware, loader program and the latest device driver code are +available from Cyclades at + ftp://ftp.cyclades.com/pub/cyclades/cyclades-z/linux/ + diff --git a/Documentation/SAK.txt b/Documentation/SAK.txt new file mode 100644 index 000000000000..b9019ca872ea --- /dev/null +++ b/Documentation/SAK.txt @@ -0,0 +1,88 @@ +Linux 2.4.2 Secure Attention Key (SAK) handling +18 March 2001, Andrew Morton + +An operating system's Secure Attention Key is a security tool which is +provided as protection against trojan password capturing programs. It +is an undefeatable way of killing all programs which could be +masquerading as login applications. Users need to be taught to enter +this key sequence before they log in to the system. + +From the PC keyboard, Linux has two similar but different ways of +providing SAK. One is the ALT-SYSRQ-K sequence. You shouldn't use +this sequence. It is only available if the kernel was compiled with +sysrq support. + +The proper way of generating a SAK is to define the key sequence using +`loadkeys'. This will work whether or not sysrq support is compiled +into the kernel. + +SAK works correctly when the keyboard is in raw mode. This means that +once defined, SAK will kill a running X server. If the system is in +run level 5, the X server will restart. This is what you want to +happen. + +What key sequence should you use? Well, CTRL-ALT-DEL is used to reboot +the machine. CTRL-ALT-BACKSPACE is magical to the X server. We'll +choose CTRL-ALT-PAUSE. + +In your rc.sysinit (or rc.local) file, add the command + + echo "control alt keycode 101 = SAK" | /bin/loadkeys + +And that's it! Only the superuser may reprogram the SAK key. + + +NOTES +===== + +1: Linux SAK is said to be not a "true SAK" as is required by + systems which implement C2 level security. This author does not + know why. + + +2: On the PC keyboard, SAK kills all applications which have + /dev/console opened. + + Unfortunately this includes a number of things which you don't + actually want killed. This is because these applications are + incorrectly holding /dev/console open. Be sure to complain to your + Linux distributor about this! + + You can identify processes which will be killed by SAK with the + command + + # ls -l /proc/[0-9]*/fd/* | grep console + l-wx------ 1 root root 64 Mar 18 00:46 /proc/579/fd/0 -> /dev/console + + Then: + + # ps aux|grep 579 + root 579 0.0 0.1 1088 436 ? S 00:43 0:00 gpm -t ps/2 + + So `gpm' will be killed by SAK. This is a bug in gpm. It should + be closing standard input. You can work around this by finding the + initscript which launches gpm and changing it thusly: + + Old: + + daemon gpm + + New: + + daemon gpm < /dev/null + + Vixie cron also seems to have this problem, and needs the same treatment. + + Also, one prominent Linux distribution has the following three + lines in its rc.sysinit and rc scripts: + + exec 3<&0 + exec 4>&1 + exec 5>&2 + + These commands cause *all* daemons which are launched by the + initscripts to have file descriptors 3, 4 and 5 attached to + /dev/console. So SAK kills them all. A workaround is to simply + delete these lines, but this may cause system management + applications to malfunction - test everything well. + diff --git a/Documentation/SecurityBugs b/Documentation/SecurityBugs new file mode 100644 index 000000000000..26c3b3635d9f --- /dev/null +++ b/Documentation/SecurityBugs @@ -0,0 +1,38 @@ +Linux kernel developers take security very seriously. As such, we'd +like to know when a security bug is found so that it can be fixed and +disclosed as quickly as possible. Please report security bugs to the +Linux kernel security team. + +1) Contact + +The Linux kernel security team can be contacted by email at +. This is a private list of security officers +who will help verify the bug report and develop and release a fix. +It is possible that the security team will bring in extra help from +area maintainers to understand and fix the security vulnerability. + +As it is with any bug, the more information provided the easier it +will be to diagnose and fix. Please review the procedure outlined in +REPORTING-BUGS if you are unclear about what information is helpful. +Any exploit code is very helpful and will not be released without +consent from the reporter unless it has already been made public. + +2) Disclosure + +The goal of the Linux kernel security team is to work with the +bug submitter to bug resolution as well as disclosure. We prefer +to fully disclose the bug as soon as possible. It is reasonable to +delay disclosure when the bug or the fix is not yet fully understood, +the solution is not well-tested or for vendor coordination. However, we +expect these delays to be short, measurable in days, not weeks or months. +A disclosure date is negotiated by the security team working with the +bug submitter as well as vendors. However, the kernel security team +holds the final say when setting a disclosure date. The timeframe for +disclosure is from immediate (esp. if it's already publically known) +to a few weeks. As a basic default policy, we expect report date to +disclosure date to be on the order of 7 days. + +3) Non-disclosure agreements + +The Linux kernel security team is not a formal body and therefore unable +to enter any non-disclosure agreements. diff --git a/Documentation/SubmittingDrivers b/Documentation/SubmittingDrivers new file mode 100644 index 000000000000..de3b252e717d --- /dev/null +++ b/Documentation/SubmittingDrivers @@ -0,0 +1,145 @@ +Submitting Drivers For The Linux Kernel +--------------------------------------- + +This document is intended to explain how to submit device drivers to the +various kernel trees. Note that if you are interested in video card drivers +you should probably talk to XFree86 (http://www.xfree86.org/) and/or X.Org +(http://x.org/) instead. + +Also read the Documentation/SubmittingPatches document. + + +Allocating Device Numbers +------------------------- + +Major and minor numbers for block and character devices are allocated +by the Linux assigned name and number authority (currently better +known as H Peter Anvin). The site is http://www.lanana.org/. This +also deals with allocating numbers for devices that are not going to +be submitted to the mainstream kernel. + +If you don't use assigned numbers then when you device is submitted it will +get given an assigned number even if that is different from values you may +have shipped to customers before. + +Who To Submit Drivers To +------------------------ + +Linux 2.0: + No new drivers are accepted for this kernel tree + +Linux 2.2: + If the code area has a general maintainer then please submit it to + the maintainer listed in MAINTAINERS in the kernel file. If the + maintainer does not respond or you cannot find the appropriate + maintainer then please contact Alan Cox + +Linux 2.4: + The same rules apply as 2.2. The final contact point for Linux 2.4 + submissions is Marcelo Tosatti . + +Linux 2.6: + The same rules apply as 2.4 except that you should follow linux-kernel + to track changes in API's. The final contact point for Linux 2.6 + submissions is Andrew Morton . + +What Criteria Determine Acceptance +---------------------------------- + +Licensing: The code must be released to us under the + GNU General Public License. We don't insist on any kind + of exclusively GPL licensing, and if you wish the driver + to be useful to other communities such as BSD you may well + wish to release under multiple licenses. + +Copyright: The copyright owner must agree to use of GPL. + It's best if the submitter and copyright owner + are the same person/entity. If not, the name of + the person/entity authorizing use of GPL should be + listed in case it's necessary to verify the will of + the copright owner. + +Interfaces: If your driver uses existing interfaces and behaves like + other drivers in the same class it will be much more likely + to be accepted than if it invents gratuitous new ones. + If you need to implement a common API over Linux and NT + drivers do it in userspace. + +Code: Please use the Linux style of code formatting as documented + in Documentation/CodingStyle. If you have sections of code + that need to be in other formats, for example because they + are shared with a windows driver kit and you want to + maintain them just once separate them out nicely and note + this fact. + +Portability: Pointers are not always 32bits, not all computers are little + endian, people do not all have floating point and you + shouldn't use inline x86 assembler in your driver without + careful thought. Pure x86 drivers generally are not popular. + If you only have x86 hardware it is hard to test portability + but it is easy to make sure the code can easily be made + portable. + +Clarity: It helps if anyone can see how to fix the driver. It helps + you because you get patches not bug reports. If you submit a + driver that intentionally obfuscates how the hardware works + it will go in the bitbucket. + +Control: In general if there is active maintainance of a driver by + the author then patches will be redirected to them unless + they are totally obvious and without need of checking. + If you want to be the contact and update point for the + driver it is a good idea to state this in the comments, + and include an entry in MAINTAINERS for your driver. + +What Criteria Do Not Determine Acceptance +----------------------------------------- + +Vendor: Being the hardware vendor and maintaining the driver is + often a good thing. If there is a stable working driver from + other people already in the tree don't expect 'we are the + vendor' to get your driver chosen. Ideally work with the + existing driver author to build a single perfect driver. + +Author: It doesn't matter if a large Linux company wrote the driver, + or you did. Nobody has any special access to the kernel + tree. Anyone who tells you otherwise isn't telling the + whole story. + + +Resources +--------- + +Linux kernel master tree: + ftp.??.kernel.org:/pub/linux/kernel/... + ?? == your country code, such as "us", "uk", "fr", etc. + +Linux kernel mailing list: + linux-kernel@vger.kernel.org + [mail majordomo@vger.kernel.org to subscribe] + +Linux Device Drivers, Third Edition (covers 2.6.10): + http://lwn.net/Kernel/LDD3/ (free version) + +Kernel traffic: + Weekly summary of kernel list activity (much easier to read) + http://www.kerneltraffic.org/kernel-traffic/ + +LWN.net: + Weekly summary of kernel development activity - http://lwn.net/ + 2.6 API changes: + http://lwn.net/Articles/2.6-kernel-api/ + Porting drivers from prior kernels to 2.6: + http://lwn.net/Articles/driver-porting/ + +KernelTrap: + Occasional Linux kernel articles and developer interviews + http://kerneltrap.org/ + +KernelNewbies: + Documentation and assistance for new kernel programmers + http://kernelnewbies.org/ + +Linux USB project: + http://sourceforge.net/projects/linux-usb/ + diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches new file mode 100644 index 000000000000..9838d32b2fe7 --- /dev/null +++ b/Documentation/SubmittingPatches @@ -0,0 +1,374 @@ + + How to Get Your Change Into the Linux Kernel + or + Care And Operation Of Your Linus Torvalds + + + +For a person or company who wishes to submit a change to the Linux +kernel, the process can sometimes be daunting if you're not familiar +with "the system." This text is a collection of suggestions which +can greatly increase the chances of your change being accepted. + +If you are submitting a driver, also read Documentation/SubmittingDrivers. + + + +-------------------------------------------- +SECTION 1 - CREATING AND SENDING YOUR CHANGE +-------------------------------------------- + + + +1) "diff -up" +------------ + +Use "diff -up" or "diff -uprN" to create patches. + +All changes to the Linux kernel occur in the form of patches, as +generated by diff(1). When creating your patch, make sure to create it +in "unified diff" format, as supplied by the '-u' argument to diff(1). +Also, please use the '-p' argument which shows which C function each +change is in - that makes the resultant diff a lot easier to read. +Patches should be based in the root kernel source directory, +not in any lower subdirectory. + +To create a patch for a single file, it is often sufficient to do: + + SRCTREE= linux-2.4 + MYFILE= drivers/net/mydriver.c + + cd $SRCTREE + cp $MYFILE $MYFILE.orig + vi $MYFILE # make your change + cd .. + diff -up $SRCTREE/$MYFILE{.orig,} > /tmp/patch + +To create a patch for multiple files, you should unpack a "vanilla", +or unmodified kernel source tree, and generate a diff against your +own source tree. For example: + + MYSRC= /devel/linux-2.4 + + tar xvfz linux-2.4.0-test11.tar.gz + mv linux linux-vanilla + wget http://www.moses.uklinux.net/patches/dontdiff + diff -uprN -X dontdiff linux-vanilla $MYSRC > /tmp/patch + rm -f dontdiff + +"dontdiff" is a list of files which are generated by the kernel during +the build process, and should be ignored in any diff(1)-generated +patch. dontdiff is maintained by Tigran Aivazian + +Make sure your patch does not include any extra files which do not +belong in a patch submission. Make sure to review your patch -after- +generated it with diff(1), to ensure accuracy. + +If your changes produce a lot of deltas, you may want to look into +splitting them into individual patches which modify things in +logical stages, this will facilitate easier reviewing by other +kernel developers, very important if you want your patch accepted. +There are a number of scripts which can aid in this; + +Quilt: +http://savannah.nongnu.org/projects/quilt + +Randy Dunlap's patch scripts: +http://developer.osdl.org/rddunlap/scripts/patching-scripts.tgz + +Andrew Morton's patch scripts: +http://www.zip.com.au/~akpm/linux/patches/patch-scripts-0.16 + +2) Describe your changes. + +Describe the technical detail of the change(s) your patch includes. + +Be as specific as possible. The WORST descriptions possible include +things like "update driver X", "bug fix for driver X", or "this patch +includes updates for subsystem X. Please apply." + +If your description starts to get long, that's a sign that you probably +need to split up your patch. See #3, next. + + + +3) Separate your changes. + +Separate each logical change into its own patch. + +For example, if your changes include both bug fixes and performance +enhancements for a single driver, separate those changes into two +or more patches. If your changes include an API update, and a new +driver which uses that new API, separate those into two patches. + +On the other hand, if you make a single change to numerous files, +group those changes into a single patch. Thus a single logical change +is contained within a single patch. + +If one patch depends on another patch in order for a change to be +complete, that is OK. Simply note "this patch depends on patch X" +in your patch description. + + +4) Select e-mail destination. + +Look through the MAINTAINERS file and the source code, and determine +if your change applies to a specific subsystem of the kernel, with +an assigned maintainer. If so, e-mail that person. + +If no maintainer is listed, or the maintainer does not respond, send +your patch to the primary Linux kernel developer's mailing list, +linux-kernel@vger.kernel.org. Most kernel developers monitor this +e-mail list, and can comment on your changes. + +Linus Torvalds is the final arbiter of all changes accepted into the +Linux kernel. His e-mail address is . He gets +a lot of e-mail, so typically you should do your best to -avoid- sending +him e-mail. + +Patches which are bug fixes, are "obvious" changes, or similarly +require little discussion should be sent or CC'd to Linus. Patches +which require discussion or do not have a clear advantage should +usually be sent first to linux-kernel. Only after the patch is +discussed should the patch then be submitted to Linus. + +For small patches you may want to CC the Trivial Patch Monkey +trivial@rustcorp.com.au set up by Rusty Russell; which collects "trivial" +patches. Trivial patches must qualify for one of the following rules: + Spelling fixes in documentation + Spelling fixes which could break grep(1). + Warning fixes (cluttering with useless warnings is bad) + Compilation fixes (only if they are actually correct) + Runtime fixes (only if they actually fix things) + Removing use of deprecated functions/macros (eg. check_region). + Contact detail and documentation fixes + Non-portable code replaced by portable code (even in arch-specific, + since people copy, as long as it's trivial) + Any fix by the author/maintainer of the file. (ie. patch monkey + in re-transmission mode) + + + +5) Select your CC (e-mail carbon copy) list. + +Unless you have a reason NOT to do so, CC linux-kernel@vger.kernel.org. + +Other kernel developers besides Linus need to be aware of your change, +so that they may comment on it and offer code review and suggestions. +linux-kernel is the primary Linux kernel developer mailing list. +Other mailing lists are available for specific subsystems, such as +USB, framebuffer devices, the VFS, the SCSI subsystem, etc. See the +MAINTAINERS file for a mailing list that relates specifically to +your change. + +Even if the maintainer did not respond in step #4, make sure to ALWAYS +copy the maintainer when you change their code. + +For small patches you may want to CC the Trivial Patch Monkey +trivial@rustcorp.com.au set up by Rusty Russell; which collects "trivial" +patches. Trivial patches must qualify for one of the following rules: + Spelling fixes in documentation + Spelling fixes which could break grep(1). + Warning fixes (cluttering with useless warnings is bad) + Compilation fixes (only if they are actually correct) + Runtime fixes (only if they actually fix things) + Removing use of deprecated functions/macros (eg. check_region). + Contact detail and documentation fixes + Non-portable code replaced by portable code (even in arch-specific, + since people copy, as long as it's trivial) + Any fix by the author/maintainer of the file. (ie. patch monkey + in re-transmission mode) + + + +6) No MIME, no links, no compression, no attachments. Just plain text. + +Linus and other kernel developers need to be able to read and comment +on the changes you are submitting. It is important for a kernel +developer to be able to "quote" your changes, using standard e-mail +tools, so that they may comment on specific portions of your code. + +For this reason, all patches should be submitting e-mail "inline". +WARNING: Be wary of your editor's word-wrap corrupting your patch, +if you choose to cut-n-paste your patch. + +Do not attach the patch as a MIME attachment, compressed or not. +Many popular e-mail applications will not always transmit a MIME +attachment as plain text, making it impossible to comment on your +code. A MIME attachment also takes Linus a bit more time to process, +decreasing the likelihood of your MIME-attached change being accepted. + +Exception: If your mailer is mangling patches then someone may ask +you to re-send them using MIME. + + + +7) E-mail size. + +When sending patches to Linus, always follow step #6. + +Large changes are not appropriate for mailing lists, and some +maintainers. If your patch, uncompressed, exceeds 40 kB in size, +it is preferred that you store your patch on an Internet-accessible +server, and provide instead a URL (link) pointing to your patch. + + + +8) Name your kernel version. + +It is important to note, either in the subject line or in the patch +description, the kernel version to which this patch applies. + +If the patch does not apply cleanly to the latest kernel version, +Linus will not apply it. + + + +9) Don't get discouraged. Re-submit. + +After you have submitted your change, be patient and wait. If Linus +likes your change and applies it, it will appear in the next version +of the kernel that he releases. + +However, if your change doesn't appear in the next version of the +kernel, there could be any number of reasons. It's YOUR job to +narrow down those reasons, correct what was wrong, and submit your +updated change. + +It is quite common for Linus to "drop" your patch without comment. +That's the nature of the system. If he drops your patch, it could be +due to +* Your patch did not apply cleanly to the latest kernel version +* Your patch was not sufficiently discussed on linux-kernel. +* A style issue (see section 2), +* An e-mail formatting issue (re-read this section) +* A technical problem with your change +* He gets tons of e-mail, and yours got lost in the shuffle +* You are being annoying (See Figure 1) + +When in doubt, solicit comments on linux-kernel mailing list. + + + +10) Include PATCH in the subject + +Due to high e-mail traffic to Linus, and to linux-kernel, it is common +convention to prefix your subject line with [PATCH]. This lets Linus +and other kernel developers more easily distinguish patches from other +e-mail discussions. + + + +11) Sign your work + +To improve tracking of who did what, especially with patches that can +percolate to their final resting place in the kernel through several +layers of maintainers, we've introduced a "sign-off" procedure on +patches that are being emailed around. + +The sign-off is a simple line at the end of the explanation for the +patch, which certifies that you wrote it or otherwise have the right to +pass it on as a open-source patch. The rules are pretty simple: if you +can certify the below: + + Developer's Certificate of Origin 1.0 + + By making a contribution to this project, I certify that: + + (a) The contribution was created in whole or in part by me and I + have the right to submit it under the open source license + indicated in the file; or + + (b) The contribution is based upon previous work that, to the best + of my knowledge, is covered under an appropriate open source + license and I have the right under that license to submit that + work with modifications, whether created in whole or in part + by me, under the same open source license (unless I am + permitted to submit under a different license), as indicated + in the file; or + + (c) The contribution was provided directly to me by some other + person who certified (a), (b) or (c) and I have not modified + it. + +then you just add a line saying + + Signed-off-by: Random J Developer + +Some people also put extra tags at the end. They'll just be ignored for +now, but you can do this to mark internal company procedures or just +point out some special detail about the sign-off. + + +----------------------------------- +SECTION 2 - HINTS, TIPS, AND TRICKS +----------------------------------- + +This section lists many of the common "rules" associated with code +submitted to the kernel. There are always exceptions... but you must +have a really good reason for doing so. You could probably call this +section Linus Computer Science 101. + + + +1) Read Documentation/CodingStyle + +Nuff said. If your code deviates too much from this, it is likely +to be rejected without further review, and without comment. + + + +2) #ifdefs are ugly + +Code cluttered with ifdefs is difficult to read and maintain. Don't do +it. Instead, put your ifdefs in a header, and conditionally define +'static inline' functions, or macros, which are used in the code. +Let the compiler optimize away the "no-op" case. + +Simple example, of poor code: + + dev = alloc_etherdev (sizeof(struct funky_private)); + if (!dev) + return -ENODEV; + #ifdef CONFIG_NET_FUNKINESS + init_funky_net(dev); + #endif + +Cleaned-up example: + +(in header) + #ifndef CONFIG_NET_FUNKINESS + static inline void init_funky_net (struct net_device *d) {} + #endif + +(in the code itself) + dev = alloc_etherdev (sizeof(struct funky_private)); + if (!dev) + return -ENODEV; + init_funky_net(dev); + + + +3) 'static inline' is better than a macro + +Static inline functions are greatly preferred over macros. +They provide type safety, have no length limitations, no formatting +limitations, and under gcc they are as cheap as macros. + +Macros should only be used for cases where a static inline is clearly +suboptimal [there a few, isolated cases of this in fast paths], +or where it is impossible to use a static inline function [such as +string-izing]. + +'static inline' is preferred over 'static __inline__', 'extern inline', +and 'extern __inline__'. + + + +4) Don't over-design. + +Don't try to anticipate nebulous future cases which may or may not +be useful: "Make it as simple as you can, and no simpler" + + + diff --git a/Documentation/VGA-softcursor.txt b/Documentation/VGA-softcursor.txt new file mode 100644 index 000000000000..70acfbf399eb --- /dev/null +++ b/Documentation/VGA-softcursor.txt @@ -0,0 +1,39 @@ +Software cursor for VGA by Pavel Machek +======================= and Martin Mares + + Linux now has some ability to manipulate cursor appearance. Normally, you +can set the size of hardware cursor (and also work around some ugly bugs in +those miserable Trident cards--see #define TRIDENT_GLITCH in drivers/video/ +vgacon.c). You can now play a few new tricks: you can make your cursor look +like a non-blinking red block, make it inverse background of the character it's +over or to highlight that character and still choose whether the original +hardware cursor should remain visible or not. There may be other things I have +never thought of. + + The cursor appearance is controlled by a "[?1;2;3c" escape sequence +where 1, 2 and 3 are parameters described below. If you omit any of them, +they will default to zeroes. + + Parameter 1 specifies cursor size (0=default, 1=invisible, 2=underline, ..., +8=full block) + 16 if you want the software cursor to be applied + 32 if you +want to always change the background color + 64 if you dislike having the +background the same as the foreground. Highlights are ignored for the last two +flags. + + The second parameter selects character attribute bits you want to change +(by simply XORing them with the value of this parameter). On standard VGA, +the high four bits specify background and the low four the foreground. In both +groups, low three bits set color (as in normal color codes used by the console) +and the most significant one turns on highlight (or sometimes blinking--it +depends on the configuration of your VGA). + + The third parameter consists of character attribute bits you want to set. +Bit setting takes place before bit toggling, so you can simply clear a bit by +including it in both the set mask and the toggle mask. + +Examples: +========= + +To get normal blinking underline, use: echo -e '\033[?2c' +To get blinking block, use: echo -e '\033[?6c' +To get red non-blinking block, use: echo -e '\033[?17;0;64c' diff --git a/Documentation/aoe/aoe.txt b/Documentation/aoe/aoe.txt new file mode 100644 index 000000000000..43e50108d0e2 --- /dev/null +++ b/Documentation/aoe/aoe.txt @@ -0,0 +1,91 @@ +The EtherDrive (R) HOWTO for users of 2.6 kernels is found at ... + + http://www.coraid.com/support/linux/EtherDrive-2.6-HOWTO.html + + It has many tips and hints! + +CREATING DEVICE NODES + + Users of udev should find the block device nodes created + automatically, but to create all the necessary device nodes, use the + udev configuration rules provided in udev.txt (in this directory). + + There is a udev-install.sh script that shows how to install these + rules on your system. + + If you are not using udev, two scripts are provided in + Documentation/aoe as examples of static device node creation for + using the aoe driver. + + rm -rf /dev/etherd + sh Documentation/aoe/mkdevs.sh /dev/etherd + + ... or to make just one shelf's worth of block device nodes ... + + sh Documentation/aoe/mkshelf.sh /dev/etherd 0 + + There is also an autoload script that shows how to edit + /etc/modprobe.conf to ensure that the aoe module is loaded when + necessary. + +USING DEVICE NODES + + "cat /dev/etherd/err" blocks, waiting for error diagnostic output, + like any retransmitted packets. + + "echo eth2 eth4 > /dev/etherd/interfaces" tells the aoe driver to + limit ATA over Ethernet traffic to eth2 and eth4. AoE traffic from + untrusted networks should be ignored as a matter of security. + + "echo > /dev/etherd/discover" tells the driver to find out what AoE + devices are available. + + These character devices may disappear and be replaced by sysfs + counterparts, so distribution maintainers are encouraged to create + scripts that use these devices. + + The block devices are named like this: + + e{shelf}.{slot} + e{shelf}.{slot}p{part} + + ... so that "e0.2" is the third blade from the left (slot 2) in the + first shelf (shelf address zero). That's the whole disk. The first + partition on that disk would be "e0.2p1". + +USING SYSFS + + Each aoe block device in /sys/block has the extra attributes of + state, mac, and netif. The state attribute is "up" when the device + is ready for I/O and "down" if detected but unusable. The + "down,closewait" state shows that the device is still open and + cannot come up again until it has been closed. + + The mac attribute is the ethernet address of the remote AoE device. + The netif attribute is the network interface on the localhost + through which we are communicating with the remote AoE device. + + There is a script in this directory that formats this information + in a convenient way. + + root@makki root# sh Documentation/aoe/status.sh + e10.0 eth3 up + e10.1 eth3 up + e10.2 eth3 up + e10.3 eth3 up + e10.4 eth3 up + e10.5 eth3 up + e10.6 eth3 up + e10.7 eth3 up + e10.8 eth3 up + e10.9 eth3 up + e4.0 eth1 up + e4.1 eth1 up + e4.2 eth1 up + e4.3 eth1 up + e4.4 eth1 up + e4.5 eth1 up + e4.6 eth1 up + e4.7 eth1 up + e4.8 eth1 up + e4.9 eth1 up diff --git a/Documentation/aoe/autoload.sh b/Documentation/aoe/autoload.sh new file mode 100644 index 000000000000..78dad1334c6f --- /dev/null +++ b/Documentation/aoe/autoload.sh @@ -0,0 +1,17 @@ +#!/bin/sh +# set aoe to autoload by installing the +# aliases in /etc/modprobe.conf + +f=/etc/modprobe.conf + +if test ! -r $f || test ! -w $f; then + echo "cannot configure $f for module autoloading" 1>&2 + exit 1 +fi + +grep major-152 $f >/dev/null +if [ $? = 1 ]; then + echo alias block-major-152 aoe >> $f + echo alias char-major-152 aoe >> $f +fi + diff --git a/Documentation/aoe/mkdevs.sh b/Documentation/aoe/mkdevs.sh new file mode 100644 index 000000000000..6ce70703eb47 --- /dev/null +++ b/Documentation/aoe/mkdevs.sh @@ -0,0 +1,36 @@ +#!/bin/sh + +n_shelves=${n_shelves:-10} +n_partitions=${n_partitions:-16} + +if test "$#" != "1"; then + echo "Usage: sh `basename $0` {dir}" 1>&2 + exit 1 +fi +dir=$1 + +MAJOR=152 + +echo "Creating AoE devnode files in $dir ..." + +set -e + +mkdir -p $dir + +# (Status info is in sysfs. See status.sh.) +# rm -f $dir/stat +# mknod -m 0400 $dir/stat c $MAJOR 1 +rm -f $dir/err +mknod -m 0400 $dir/err c $MAJOR 2 +rm -f $dir/discover +mknod -m 0200 $dir/discover c $MAJOR 3 +rm -f $dir/interfaces +mknod -m 0200 $dir/interfaces c $MAJOR 4 + +export n_partitions +mkshelf=`echo $0 | sed 's!mkdevs!mkshelf!'` +i=0 +while test $i -lt $n_shelves; do + sh -xc "sh $mkshelf $dir $i" + i=`expr $i + 1` +done diff --git a/Documentation/aoe/mkshelf.sh b/Documentation/aoe/mkshelf.sh new file mode 100644 index 000000000000..40932836bb80 --- /dev/null +++ b/Documentation/aoe/mkshelf.sh @@ -0,0 +1,25 @@ +#! /bin/sh + +if test "$#" != "2"; then + echo "Usage: sh `basename $0` {dir} {shelfaddress}" 1>&2 + exit 1 +fi +n_partitions=${n_partitions:-16} +dir=$1 +shelf=$2 +MAJOR=152 + +set -e + +minor=`echo 10 \* $shelf \* $n_partitions | bc` +endp=`echo $n_partitions - 1 | bc` +for slot in `seq 0 9`; do + for part in `seq 0 $endp`; do + name=e$shelf.$slot + test "$part" != "0" && name=${name}p$part + rm -f $dir/$name + mknod -m 0660 $dir/$name b $MAJOR $minor + + minor=`expr $minor + 1` + done +done diff --git a/Documentation/aoe/status.sh b/Documentation/aoe/status.sh new file mode 100644 index 000000000000..6628116d4a9f --- /dev/null +++ b/Documentation/aoe/status.sh @@ -0,0 +1,31 @@ +#! /bin/sh +# collate and present sysfs information about AoE storage + +set -e +format="%8s\t%8s\t%8s\n" +me=`basename $0` +sysd=${sysfs_dir:-/sys} + +# printf "$format" device mac netif state + +# Suse 9.1 Pro doesn't put /sys in /etc/mtab +#test -z "`mount | grep sysfs`" && { +test ! -d "$sysd/block" && { + echo "$me Error: sysfs is not mounted" 1>&2 + exit 1 +} +test -z "`lsmod | grep '^aoe'`" && { + echo "$me Error: aoe module is not loaded" 1>&2 + exit 1 +} + +for d in `ls -d $sysd/block/etherd* 2>/dev/null | grep -v p` end; do + # maybe ls comes up empty, so we use "end" + test $d = end && continue + + dev=`echo "$d" | sed 's/.*!//'` + printf "$format" \ + "$dev" \ + "`cat \"$d/netif\"`" \ + "`cat \"$d/state\"`" +done | sort diff --git a/Documentation/aoe/udev-install.sh b/Documentation/aoe/udev-install.sh new file mode 100644 index 000000000000..861a27f98771 --- /dev/null +++ b/Documentation/aoe/udev-install.sh @@ -0,0 +1,26 @@ +# install the aoe-specific udev rules from udev.txt into +# the system's udev configuration +# + +me="`basename $0`" + +# find udev.conf, often /etc/udev/udev.conf +# (or environment can specify where to find udev.conf) +# +if test -z "$conf"; then + if test -r /etc/udev/udev.conf; then + conf=/etc/udev/udev.conf + else + conf="`find /etc -type f -name udev.conf 2> /dev/null`" + if test -z "$conf" || test ! -r "$conf"; then + echo "$me Error: no udev.conf found" 1>&2 + exit 1 + fi + fi +fi + +# find the directory where udev rules are stored, often +# /etc/udev/rules.d +# +rules_d="`sed -n '/^udev_rules=/{ s!udev_rules=!!; s!\"!!g; p; }' $conf`" +test "$rules_d" && sh -xc "cp `dirname $0`/udev.txt $rules_d/60-aoe.rules" diff --git a/Documentation/aoe/udev.txt b/Documentation/aoe/udev.txt new file mode 100644 index 000000000000..ab39d8bb634c --- /dev/null +++ b/Documentation/aoe/udev.txt @@ -0,0 +1,23 @@ +# These rules tell udev what device nodes to create for aoe support. +# They may be installed along the following lines (adjusted to what +# you see on your system). +# +# ecashin@makki ~$ su +# Password: +# bash# find /etc -type f -name udev.conf +# /etc/udev/udev.conf +# bash# grep udev_rules= /etc/udev/udev.conf +# udev_rules="/etc/udev/rules.d/" +# bash# ls /etc/udev/rules.d/ +# 10-wacom.rules 50-udev.rules +# bash# cp /path/to/linux-2.6.xx/Documentation/aoe/udev.txt \ +# /etc/udev/rules.d/60-aoe.rules +# + +# aoe char devices +SUBSYSTEM="aoe", KERNEL="discover", NAME="etherd/%k", GROUP="disk", MODE="0220" +SUBSYSTEM="aoe", KERNEL="err", NAME="etherd/%k", GROUP="disk", MODE="0440" +SUBSYSTEM="aoe", KERNEL="interfaces", NAME="etherd/%k", GROUP="disk", MODE="0220" + +# aoe block devices +KERNEL="etherd*", NAME="%k", GROUP="disk" diff --git a/Documentation/arm/00-INDEX b/Documentation/arm/00-INDEX new file mode 100644 index 000000000000..d753fe59a248 --- /dev/null +++ b/Documentation/arm/00-INDEX @@ -0,0 +1,20 @@ +00-INDEX + - this file +Booting + - requirements for booting +Interrupts + - ARM Interrupt subsystem documentation +Netwinder + - Netwinder specific documentation +README + - General ARM documentation +SA1100 + - SA1100 documentation +XScale + - XScale documentation +empeg + - Empeg documentation +mem_alignment + - alignment abort handler documentation +nwfpe + - NWFPE floating point emulator documentation diff --git a/Documentation/arm/Booting b/Documentation/arm/Booting new file mode 100644 index 000000000000..fad566bb02fc --- /dev/null +++ b/Documentation/arm/Booting @@ -0,0 +1,141 @@ + Booting ARM Linux + ================= + +Author: Russell King +Date : 18 May 2002 + +The following documentation is relevant to 2.4.18-rmk6 and beyond. + +In order to boot ARM Linux, you require a boot loader, which is a small +program that runs before the main kernel. The boot loader is expected +to initialise various devices, and eventually call the Linux kernel, +passing information to the kernel. + +Essentially, the boot loader should provide (as a minimum) the +following: + +1. Setup and initialise the RAM. +2. Initialise one serial port. +3. Detect the machine type. +4. Setup the kernel tagged list. +5. Call the kernel image. + + +1. Setup and initialise RAM +--------------------------- + +Existing boot loaders: MANDATORY +New boot loaders: MANDATORY + +The boot loader is expected to find and initialise all RAM that the +kernel will use for volatile data storage in the system. It performs +this in a machine dependent manner. (It may use internal algorithms +to automatically locate and size all RAM, or it may use knowledge of +the RAM in the machine, or any other method the boot loader designer +sees fit.) + + +2. Initialise one serial port +----------------------------- + +Existing boot loaders: OPTIONAL, RECOMMENDED +New boot loaders: OPTIONAL, RECOMMENDED + +The boot loader should initialise and enable one serial port on the +target. This allows the kernel serial driver to automatically detect +which serial port it should use for the kernel console (generally +used for debugging purposes, or communication with the target.) + +As an alternative, the boot loader can pass the relevant 'console=' +option to the kernel via the tagged lists specifying the port, and +serial format options as described in + + Documentation/kernel-parameters.txt. + + +3. Detect the machine type +-------------------------- + +Existing boot loaders: OPTIONAL +New boot loaders: MANDATORY + +The boot loader should detect the machine type its running on by some +method. Whether this is a hard coded value or some algorithm that +looks at the connected hardware is beyond the scope of this document. +The boot loader must ultimately be able to provide a MACH_TYPE_xxx +value to the kernel. (see linux/arch/arm/tools/mach-types). + + +4. Setup the kernel tagged list +------------------------------- + +Existing boot loaders: OPTIONAL, HIGHLY RECOMMENDED +New boot loaders: MANDATORY + +The boot loader must create and initialise the kernel tagged list. +A valid tagged list starts with ATAG_CORE and ends with ATAG_NONE. +The ATAG_CORE tag may or may not be empty. An empty ATAG_CORE tag +has the size field set to '2' (0x00000002). The ATAG_NONE must set +the size field to zero. + +Any number of tags can be placed in the list. It is undefined +whether a repeated tag appends to the information carried by the +previous tag, or whether it replaces the information in its +entirety; some tags behave as the former, others the latter. + +The boot loader must pass at a minimum the size and location of +the system memory, and root filesystem location. Therefore, the +minimum tagged list should look: + + +-----------+ +base -> | ATAG_CORE | | + +-----------+ | + | ATAG_MEM | | increasing address + +-----------+ | + | ATAG_NONE | | + +-----------+ v + +The tagged list should be stored in system RAM. + +The tagged list must be placed in a region of memory where neither +the kernel decompressor nor initrd 'bootp' program will overwrite +it. The recommended placement is in the first 16KiB of RAM. + +5. Calling the kernel image +--------------------------- + +Existing boot loaders: MANDATORY +New boot loaders: MANDATORY + +There are two options for calling the kernel zImage. If the zImage +is stored in flash, and is linked correctly to be run from flash, +then it is legal for the boot loader to call the zImage in flash +directly. + +The zImage may also be placed in system RAM (at any location) and +called there. Note that the kernel uses 16K of RAM below the image +to store page tables. The recommended placement is 32KiB into RAM. + +In either case, the following conditions must be met: + +- Quiesce all DMA capable devicess so that memory does not get + corrupted by bogus network packets or disk data. This will save + you many hours of debug. + +- CPU register settings + r0 = 0, + r1 = machine type number discovered in (3) above. + r2 = physical address of tagged list in system RAM. + +- CPU mode + All forms of interrupts must be disabled (IRQs and FIQs) + The CPU must be in SVC mode. (A special exception exists for Angel) + +- Caches, MMUs + The MMU must be off. + Instruction cache may be on or off. + Data cache must be off. + +- The boot loader is expected to call the kernel image by jumping + directly to the first instruction of the kernel image. + diff --git a/Documentation/arm/IXP2000 b/Documentation/arm/IXP2000 new file mode 100644 index 000000000000..e0148b6b2c40 --- /dev/null +++ b/Documentation/arm/IXP2000 @@ -0,0 +1,69 @@ + +------------------------------------------------------------------------- +Release Notes for Linux on Intel's IXP2000 Network Processor + +Maintained by Deepak Saxena +------------------------------------------------------------------------- + +1. Overview + +Intel's IXP2000 family of NPUs (IXP2400, IXP2800, IXP2850) is designed +for high-performance network applications such high-availability +telecom systems. In addition to an XScale core, it contains up to 8 +"MicroEngines" that run special code, several high-end networking +interfaces (UTOPIA, SPI, etc), a PCI host bridge, one serial port, +flash interface, and some other odds and ends. For more information, see: + +http://developer.intel.com/design/network/products/npfamily/ixp2xxx.htm + +2. Linux Support + +Linux currently supports the following features on the IXP2000 NPUs: + +- On-chip serial +- PCI +- Flash (MTD/JFFS2) +- I2C through GPIO +- Timers (watchdog, OS) + +That is about all we can support under Linux ATM b/c the core networking +components of the chip are accessed via Intel's closed source SDK. +Please contact Intel directly on issues with using those. There is +also a mailing list run by some folks at Princeton University that might +be of help: https://lists.cs.princeton.edu/mailman/listinfo/ixp2xxx + +WHATEVER YOU DO, DO NOT POST EMAIL TO THE LINUX-ARM OR LINUX-ARM-KERNEL +MAILING LISTS REGARDING THE INTEL SDK. + +3. Supported Platforms + +- Intel IXDP2400 Reference Platform +- Intel IXDP2800 Reference Platform +- Intel IXDP2401 Reference Platform +- Intel IXDP2801 Reference Platform +- RadiSys ENP-2611 + +4. Usage Notes + +- The IXP2000 platforms usually have rather complex PCI bus topologies + with large memory space requirements. In addition, b/c of the way the + Intel SDK is designed, devices are enumerated in a very specific + way. B/c of this this, we use "pci=firmware" option in the kernel + command line so that we do not re-enumerate the bus. + +- IXDP2x01 systems have variable clock tick rates that we cannot determine + via HW registers. The "ixdp2x01_clk=XXX" cmd line options allow you + to pass the clock rate to the board port. + +5. Thanks + +The IXP2000 work has been funded by Intel Corp. and MontaVista Software, Inc. + +The following people have contributed patches/comments/etc: + +Naeem F. Afzal +Lennert Buytenhek +Jeffrey Daly + +------------------------------------------------------------------------- +Last Update: 8/09/2004 diff --git a/Documentation/arm/IXP4xx b/Documentation/arm/IXP4xx new file mode 100644 index 000000000000..d4c6d3aa0c25 --- /dev/null +++ b/Documentation/arm/IXP4xx @@ -0,0 +1,174 @@ + +------------------------------------------------------------------------- +Release Notes for Linux on Intel's IXP4xx Network Processor + +Maintained by Deepak Saxena +------------------------------------------------------------------------- + +1. Overview + +Intel's IXP4xx network processor is a highly integrated SOC that +is targeted for network applications, though it has become popular +in industrial control and other areas due to low cost and power +consumption. The IXP4xx family currently consists of several processors +that support different network offload functions such as encryption, +routing, firewalling, etc. The IXP46x family is an updated version which +supports faster speeds, new memory and flash configurations, and more +integration such as an on-chip I2C controller. + +For more information on the various versions of the CPU, see: + + http://developer.intel.com/design/network/products/npfamily/ixp4xx.htm + +Intel also made the IXCP1100 CPU for sometime which is an IXP4xx +stripped of much of the network intelligence. + +2. Linux Support + +Linux currently supports the following features on the IXP4xx chips: + +- Dual serial ports +- PCI interface +- Flash access (MTD/JFFS) +- I2C through GPIO on IXP42x +- GPIO for input/output/interrupts + See include/asm-arm/arch-ixp4xx/platform.h for access functions. +- Timers (watchdog, OS) + +The following components of the chips are not supported by Linux and +require the use of Intel's propietary CSR softare: + +- USB device interface +- Network interfaces (HSS, Utopia, NPEs, etc) +- Network offload functionality + +If you need to use any of the above, you need to download Intel's +software from: + + http://developer.intel.com/design/network/products/npfamily/ixp425swr1.htm + +DO NOT POST QUESTIONS TO THE LINUX MAILING LISTS REGARDING THE PROPIETARY +SOFTWARE. + +There are several websites that provide directions/pointers on using +Intel's software: + +http://ixp4xx-osdg.sourceforge.net/ + Open Source Developer's Guide for using uClinux and the Intel libraries + +http://gatewaymaker.sourceforge.net/ + Simple one page summary of building a gateway using an IXP425 and Linux + +http://ixp425.sourceforge.net/ + ATM device driver for IXP425 that relies on Intel's libraries + +3. Known Issues/Limitations + +3a. Limited inbound PCI window + +The IXP4xx family allows for up to 256MB of memory but the PCI interface +can only expose 64MB of that memory to the PCI bus. This means that if +you are running with > 64MB, all PCI buffers outside of the accessible +range will be bounced using the routines in arch/arm/common/dmabounce.c. + +3b. Limited outbound PCI window + +IXP4xx provides two methods of accessing PCI memory space: + +1) A direct mapped window from 0x48000000 to 0x4bffffff (64MB). + To access PCI via this space, we simply ioremap() the BAR + into the kernel and we can use the standard read[bwl]/write[bwl] + macros. This is the preffered method due to speed but it + limits the system to just 64MB of PCI memory. This can be + problamatic if using video cards and other memory-heavy devices. + +2) If > 64MB of memory space is required, the IXP4xx can be + configured to use indirect registers to access PCI This allows + for up to 128MB (0x48000000 to 0x4fffffff) of memory on the bus. + The disadvantadge of this is that every PCI access requires + three local register accesses plus a spinlock, but in some + cases the performance hit is acceptable. In addition, you cannot + mmap() PCI devices in this case due to the indirect nature + of the PCI window. + +By default, the direct method is used for performance reasons. If +you need more PCI memory, enable the IXP4XX_INDIRECT_PCI config option. + +3c. GPIO as Interrupts + +Currently the code only handles level-sensitive GPIO interrupts + +4. Supported platforms + +ADI Engineering Coyote Gateway Reference Platform +http://www.adiengineering.com/productsCoyote.html + + The ADI Coyote platform is reference design for those building + small residential/office gateways. One NPE is connected to a 10/100 + interface, one to 4-port 10/100 switch, and the third to and ADSL + interface. In addition, it also supports to POTs interfaces connected + via SLICs. Note that those are not supported by Linux ATM. Finally, + the platform has two mini-PCI slots used for 802.11[bga] cards. + Finally, there is an IDE port hanging off the expansion bus. + +Gateworks Avila Network Platform +http://www.gateworks.com/avila_sbc.htm + + The Avila platform is basically and IXDP425 with the 4 PCI slots + replaced with mini-PCI slots and a CF IDE interface hanging off + the expansion bus. + +Intel IXDP425 Development Platform +http://developer.intel.com/design/network/products/npfamily/ixdp425.htm + + This is Intel's standard reference platform for the IXDP425 and is + also known as the Richfield board. It contains 4 PCI slots, 16MB + of flash, two 10/100 ports and one ADSL port. + +Intel IXDP465 Development Platform +http://developer.intel.com/design/network/products/npfamily/ixdp465.htm + + This is basically an IXDP425 with an IXP465 and 32M of flash instead + of just 16. + +Intel IXDPG425 Development Platform + + This is basically and ADI Coyote board with a NEC EHCI controller + added. One issue with this board is that the mini-PCI slots only + have the 3.3v line connected, so you can't use a PCI to mini-PCI + adapter with an E100 card. So to NFS root you need to use either + the CSR or a WiFi card and a ramdisk that BOOTPs and then does + a pivot_root to NFS. + +Motorola PrPMC1100 Processor Mezanine Card +http://www.fountainsys.com/datasheet/PrPMC1100.pdf + + The PrPMC1100 is based on the IXCP1100 and is meant to plug into + and IXP2400/2800 system to act as the system controller. It simply + contains a CPU and 16MB of flash on the board and needs to be + plugged into a carrier board to function. Currently Linux only + supports the Motorola PrPMC carrier board for this platform. + See https://mcg.motorola.com/us/ds/pdf/ds0144.pdf for info + on the carrier board. + +5. TODO LIST + +- Add support for Coyote IDE +- Add support for edge-based GPIO interrupts +- Add support for CF IDE on expansion bus + +6. Thanks + +The IXP4xx work has been funded by Intel Corp. and MontaVista Software, Inc. + +The following people have contributed patches/comments/etc: + +Lennerty Buytenhek +Lutz Jaenicke +Justin Mayfield +Robert E. Ranslam +[I know I've forgotten others, please email me to be added] + +------------------------------------------------------------------------- + +Last Update: 01/04/2005 diff --git a/Documentation/arm/Interrupts b/Documentation/arm/Interrupts new file mode 100644 index 000000000000..72c93de8cd4e --- /dev/null +++ b/Documentation/arm/Interrupts @@ -0,0 +1,173 @@ +2.5.2-rmk5 +---------- + +This is the first kernel that contains a major shake up of some of the +major architecture-specific subsystems. + +Firstly, it contains some pretty major changes to the way we handle the +MMU TLB. Each MMU TLB variant is now handled completely separately - +we have TLB v3, TLB v4 (without write buffer), TLB v4 (with write buffer), +and finally TLB v4 (with write buffer, with I TLB invalidate entry). +There is more assembly code inside each of these functions, mainly to +allow more flexible TLB handling for the future. + +Secondly, the IRQ subsystem. + +The 2.5 kernels will be having major changes to the way IRQs are handled. +Unfortunately, this means that machine types that touch the irq_desc[] +array (basically all machine types) will break, and this means every +machine type that we currently have. + +Lets take an example. On the Assabet with Neponset, we have: + + GPIO25 IRR:2 + SA1100 ------------> Neponset -----------> SA1111 + IIR:1 + -----------> USAR + IIR:0 + -----------> SMC9196 + +The way stuff currently works, all SA1111 interrupts are mutually +exclusive of each other - if you're processing one interrupt from the +SA1111 and another comes in, you have to wait for that interrupt to +finish processing before you can service the new interrupt. Eg, an +IDE PIO-based interrupt on the SA1111 excludes all other SA1111 and +SMC9196 interrupts until it has finished transferring its multi-sector +data, which can be a long time. Note also that since we loop in the +SA1111 IRQ handler, SA1111 IRQs can hold off SMC9196 IRQs indefinitely. + + +The new approach brings several new ideas... + +We introduce the concept of a "parent" and a "child". For example, +to the Neponset handler, the "parent" is GPIO25, and the "children"d +are SA1111, SMC9196 and USAR. + +We also bring the idea of an IRQ "chip" (mainly to reduce the size of +the irqdesc array). This doesn't have to be a real "IC"; indeed the +SA11x0 IRQs are handled by two separate "chip" structures, one for +GPIO0-10, and another for all the rest. It is just a container for +the various operations (maybe this'll change to a better name). +This structure has the following operations: + +struct irqchip { + /* + * Acknowledge the IRQ. + * If this is a level-based IRQ, then it is expected to mask the IRQ + * as well. + */ + void (*ack)(unsigned int irq); + /* + * Mask the IRQ in hardware. + */ + void (*mask)(unsigned int irq); + /* + * Unmask the IRQ in hardware. + */ + void (*unmask)(unsigned int irq); + /* + * Re-run the IRQ + */ + void (*rerun)(unsigned int irq); + /* + * Set the type of the IRQ. + */ + int (*type)(unsigned int irq, unsigned int, type); +}; + +ack - required. May be the same function as mask for IRQs + handled by do_level_IRQ. +mask - required. +unmask - required. +rerun - optional. Not required if you're using do_level_IRQ for all + IRQs that use this 'irqchip'. Generally expected to re-trigger + the hardware IRQ if possible. If not, may call the handler + directly. +type - optional. If you don't support changing the type of an IRQ, + it should be null so people can detect if they are unable to + set the IRQ type. + +For each IRQ, we keep the following information: + + - "disable" depth (number of disable_irq()s without enable_irq()s) + - flags indicating what we can do with this IRQ (valid, probe, + noautounmask) as before + - status of the IRQ (probing, enable, etc) + - chip + - per-IRQ handler + - irqaction structure list + +The handler can be one of the 3 standard handlers - "level", "edge" and +"simple", or your own specific handler if you need to do something special. + +The "level" handler is what we currently have - its pretty simple. +"edge" knows about the brokenness of such IRQ implementations - that you +need to leave the hardware IRQ enabled while processing it, and queueing +further IRQ events should the IRQ happen again while processing. The +"simple" handler is very basic, and does not perform any hardware +manipulation, nor state tracking. This is useful for things like the +SMC9196 and USAR above. + +So, what's changed? + +1. Machine implementations must not write to the irqdesc array. + +2. New functions to manipulate the irqdesc array. The first 4 are expected + to be useful only to machine specific code. The last is recommended to + only be used by machine specific code, but may be used in drivers if + absolutely necessary. + + set_irq_chip(irq,chip) + + Set the mask/unmask methods for handling this IRQ + + set_irq_handler(irq,handler) + + Set the handler for this IRQ (level, edge, simple) + + set_irq_chained_handler(irq,handler) + + Set a "chained" handler for this IRQ - automatically + enables this IRQ (eg, Neponset and SA1111 handlers). + + set_irq_flags(irq,flags) + + Set the valid/probe/noautoenable flags. + + set_irq_type(irq,type) + + Set active the IRQ edge(s)/level. This replaces the + SA1111 INTPOL manipulation, and the set_GPIO_IRQ_edge() + function. Type should be one of the following: + + #define IRQT_NOEDGE (0) + #define IRQT_RISING (__IRQT_RISEDGE) + #define IRQT_FALLING (__IRQT_FALEDGE) + #define IRQT_BOTHEDGE (__IRQT_RISEDGE|__IRQT_FALEDGE) + #define IRQT_LOW (__IRQT_LOWLVL) + #define IRQT_HIGH (__IRQT_HIGHLVL) + +3. set_GPIO_IRQ_edge() is obsolete, and should be replaced by set_irq_type. + +4. Direct access to SA1111 INTPOL is depreciated. Use set_irq_type instead. + +5. A handler is expected to perform any necessary acknowledgement of the + parent IRQ via the correct chip specific function. For instance, if + the SA1111 is directly connected to a SA1110 GPIO, then you should + acknowledge the SA1110 IRQ each time you re-read the SA1111 IRQ status. + +6. For any child which doesn't have its own IRQ enable/disable controls + (eg, SMC9196), the handler must mask or acknowledge the parent IRQ + while the child handler is called, and the child handler should be the + "simple" handler (not "edge" nor "level"). After the handler completes, + the parent IRQ should be unmasked, and the status of all children must + be re-checked for pending events. (see the Neponset IRQ handler for + details). + +7. fixup_irq() is gone, as is include/asm-arm/arch-*/irq.h + +Please note that this will not solve all problems - some of them are +hardware based. Mixing level-based and edge-based IRQs on the same +parent signal (eg neponset) is one such area where a software based +solution can't provide the full answer to low IRQ latency. + diff --git a/Documentation/arm/Netwinder b/Documentation/arm/Netwinder new file mode 100644 index 000000000000..f1b457fbd3de --- /dev/null +++ b/Documentation/arm/Netwinder @@ -0,0 +1,78 @@ +NetWinder specific documentation +================================ + +The NetWinder is a small low-power computer, primarily designed +to run Linux. It is based around the StrongARM RISC processor, +DC21285 PCI bridge, with PC-type hardware glued around it. + +Port usage +========== + +Min - Max Description +--------------------------- +0x0000 - 0x000f DMA1 +0x0020 - 0x0021 PIC1 +0x0060 - 0x006f Keyboard +0x0070 - 0x007f RTC +0x0080 - 0x0087 DMA1 +0x0088 - 0x008f DMA2 +0x00a0 - 0x00a3 PIC2 +0x00c0 - 0x00df DMA2 +0x0180 - 0x0187 IRDA +0x01f0 - 0x01f6 ide0 +0x0201 Game port +0x0203 RWA010 configuration read +0x0220 - ? SoundBlaster +0x0250 - ? WaveArtist +0x0279 RWA010 configuration index +0x02f8 - 0x02ff Serial ttyS1 +0x0300 - 0x031f Ether10 +0x0338 GPIO1 +0x033a GPIO2 +0x0370 - 0x0371 W83977F configuration registers +0x0388 - ? AdLib +0x03c0 - 0x03df VGA +0x03f6 ide0 +0x03f8 - 0x03ff Serial ttyS0 +0x0400 - 0x0408 DC21143 +0x0480 - 0x0487 DMA1 +0x0488 - 0x048f DMA2 +0x0a79 RWA010 configuration write +0xe800 - 0xe80f ide0/ide1 BM DMA + + +Interrupt usage +=============== + +IRQ type Description +--------------------------- + 0 ISA 100Hz timer + 1 ISA Keyboard + 2 ISA cascade + 3 ISA Serial ttyS1 + 4 ISA Serial ttyS0 + 5 ISA PS/2 mouse + 6 ISA IRDA + 7 ISA Printer + 8 ISA RTC alarm + 9 ISA +10 ISA GP10 (Orange reset button) +11 ISA +12 ISA WaveArtist +13 ISA +14 ISA hda1 +15 ISA + +DMA usage +========= + +DMA type Description +--------------------------- + 0 ISA IRDA + 1 ISA + 2 ISA cascade + 3 ISA WaveArtist + 4 ISA + 5 ISA + 6 ISA + 7 ISA WaveArtist diff --git a/Documentation/arm/Porting b/Documentation/arm/Porting new file mode 100644 index 000000000000..a492233931b9 --- /dev/null +++ b/Documentation/arm/Porting @@ -0,0 +1,135 @@ +Taken from list archive at http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2001-July/004064.html + +Initial definitions +------------------- + +The following symbol definitions rely on you knowing the translation that +__virt_to_phys() does for your machine. This macro converts the passed +virtual address to a physical address. Normally, it is simply: + + phys = virt - PAGE_OFFSET + PHYS_OFFSET + + +Decompressor Symbols +-------------------- + +ZTEXTADDR + Start address of decompressor. There's no point in talking about + virtual or physical addresses here, since the MMU will be off at + the time when you call the decompressor code. You normally call + the kernel at this address to start it booting. This doesn't have + to be located in RAM, it can be in flash or other read-only or + read-write addressable medium. + +ZBSSADDR + Start address of zero-initialised work area for the decompressor. + This must be pointing at RAM. The decompressor will zero initialise + this for you. Again, the MMU will be off. + +ZRELADDR + This is the address where the decompressed kernel will be written, + and eventually executed. The following constraint must be valid: + + __virt_to_phys(TEXTADDR) == ZRELADDR + + The initial part of the kernel is carefully coded to be position + independent. + +INITRD_PHYS + Physical address to place the initial RAM disk. Only relevant if + you are using the bootpImage stuff (which only works on the old + struct param_struct). + +INITRD_VIRT + Virtual address of the initial RAM disk. The following constraint + must be valid: + + __virt_to_phys(INITRD_VIRT) == INITRD_PHYS + +PARAMS_PHYS + Physical address of the struct param_struct or tag list, giving the + kernel various parameters about its execution environment. + + +Kernel Symbols +-------------- + +PHYS_OFFSET + Physical start address of the first bank of RAM. + +PAGE_OFFSET + Virtual start address of the first bank of RAM. During the kernel + boot phase, virtual address PAGE_OFFSET will be mapped to physical + address PHYS_OFFSET, along with any other mappings you supply. + This should be the same value as TASK_SIZE. + +TASK_SIZE + The maximum size of a user process in bytes. Since user space + always starts at zero, this is the maximum address that a user + process can access+1. The user space stack grows down from this + address. + + Any virtual address below TASK_SIZE is deemed to be user process + area, and therefore managed dynamically on a process by process + basis by the kernel. I'll call this the user segment. + + Anything above TASK_SIZE is common to all processes. I'll call + this the kernel segment. + + (In other words, you can't put IO mappings below TASK_SIZE, and + hence PAGE_OFFSET). + +TEXTADDR + Virtual start address of kernel, normally PAGE_OFFSET + 0x8000. + This is where the kernel image ends up. With the latest kernels, + it must be located at 32768 bytes into a 128MB region. Previous + kernels placed a restriction of 256MB here. + +DATAADDR + Virtual address for the kernel data segment. Must not be defined + when using the decompressor. + +VMALLOC_START +VMALLOC_END + Virtual addresses bounding the vmalloc() area. There must not be + any static mappings in this area; vmalloc will overwrite them. + The addresses must also be in the kernel segment (see above). + Normally, the vmalloc() area starts VMALLOC_OFFSET bytes above the + last virtual RAM address (found using variable high_memory). + +VMALLOC_OFFSET + Offset normally incorporated into VMALLOC_START to provide a hole + between virtual RAM and the vmalloc area. We do this to allow + out of bounds memory accesses (eg, something writing off the end + of the mapped memory map) to be caught. Normally set to 8MB. + +Architecture Specific Macros +---------------------------- + +BOOT_MEM(pram,pio,vio) + `pram' specifies the physical start address of RAM. Must always + be present, and should be the same as PHYS_OFFSET. + + `pio' is the physical address of an 8MB region containing IO for + use with the debugging macros in arch/arm/kernel/debug-armv.S. + + `vio' is the virtual address of the 8MB debugging region. + + It is expected that the debugging region will be re-initialised + by the architecture specific code later in the code (via the + MAPIO function). + +BOOT_PARAMS + Same as, and see PARAMS_PHYS. + +FIXUP(func) + Machine specific fixups, run before memory subsystems have been + initialised. + +MAPIO(func) + Machine specific function to map IO areas (including the debug + region above). + +INITIRQ(func) + Machine specific function to initialise interrupts. + diff --git a/Documentation/arm/README b/Documentation/arm/README new file mode 100644 index 000000000000..a6f718e90a86 --- /dev/null +++ b/Documentation/arm/README @@ -0,0 +1,198 @@ + ARM Linux 2.6 + ============= + + Please check for + updates. + +Compilation of kernel +--------------------- + + In order to compile ARM Linux, you will need a compiler capable of + generating ARM ELF code with GNU extensions. GCC 2.95.1, EGCS + 1.1.2, and GCC 3.3 are known to be good compilers. Fortunately, you + needn't guess. The kernel will report an error if your compiler is + a recognized offender. + + To build ARM Linux natively, you shouldn't have to alter the ARCH = line + in the top level Makefile. However, if you don't have the ARM Linux ELF + tools installed as default, then you should change the CROSS_COMPILE + line as detailed below. + + If you wish to cross-compile, then alter the following lines in the top + level make file: + + ARCH = + with + ARCH = arm + + and + + CROSS_COMPILE= + to + CROSS_COMPILE= + eg. + CROSS_COMPILE=arm-linux- + + Do a 'make config', followed by 'make Image' to build the kernel + (arch/arm/boot/Image). A compressed image can be built by doing a + 'make zImage' instead of 'make Image'. + + +Bug reports etc +--------------- + + Please send patches to the patch system. For more information, see + http://www.arm.linux.org.uk/patches/info.html Always include some + explanation as to what the patch does and why it is needed. + + Bug reports should be sent to linux-arm-kernel@lists.arm.linux.org.uk, + or submitted through the web form at + http://www.arm.linux.org.uk/forms/solution.shtml + + When sending bug reports, please ensure that they contain all relevant + information, eg. the kernel messages that were printed before/during + the problem, what you were doing, etc. + + +Include files +------------- + + Several new include directories have been created under include/asm-arm, + which are there to reduce the clutter in the top-level directory. These + directories, and their purpose is listed below: + + arch-* machine/platform specific header files + hardware driver-internal ARM specific data structures/definitions + mach descriptions of generic ARM to specific machine interfaces + proc-* processor dependent header files (currently only two + categories) + + +Machine/Platform support +------------------------ + + The ARM tree contains support for a lot of different machine types. To + continue supporting these differences, it has become necessary to split + machine-specific parts by directory. For this, the machine category is + used to select which directories and files get included (we will use + $(MACHINE) to refer to the category) + + To this end, we now have arch/arm/mach-$(MACHINE) directories which are + designed to house the non-driver files for a particular machine (eg, PCI, + memory management, architecture definitions etc). For all future + machines, there should be a corresponding include/asm-arm/arch-$(MACHINE) + directory. + + +Modules +------- + + Although modularisation is supported (and required for the FP emulator), + each module on an ARM2/ARM250/ARM3 machine when is loaded will take + memory up to the next 32k boundary due to the size of the pages. + Therefore, modularisation on these machines really worth it? + + However, ARM6 and up machines allow modules to take multiples of 4k, and + as such Acorn RiscPCs and other architectures using these processors can + make good use of modularisation. + + +ADFS Image files +---------------- + + You can access image files on your ADFS partitions by mounting the ADFS + partition, and then using the loopback device driver. You must have + losetup installed. + + Please note that the PCEmulator DOS partitions have a partition table at + the start, and as such, you will have to give '-o offset' to losetup. + + +Request to developers +--------------------- + + When writing device drivers which include a separate assembler file, please + include it in with the C file, and not the arch/arm/lib directory. This + allows the driver to be compiled as a loadable module without requiring + half the code to be compiled into the kernel image. + + In general, try to avoid using assembler unless it is really necessary. It + makes drivers far less easy to port to other hardware. + + +ST506 hard drives +----------------- + + The ST506 hard drive controllers seem to be working fine (if a little + slowly). At the moment they will only work off the controllers on an + A4x0's motherboard, but for it to work off a Podule just requires + someone with a podule to add the addresses for the IRQ mask and the + HDC base to the source. + + As of 31/3/96 it works with two drives (you should get the ADFS + *configure harddrive set to 2). I've got an internal 20MB and a great + big external 5.25" FH 64MB drive (who could ever want more :-) ). + + I've just got 240K/s off it (a dd with bs=128k); thats about half of what + RiscOS gets; but it's a heck of a lot better than the 50K/s I was getting + last week :-) + + Known bug: Drive data errors can cause a hang; including cases where + the controller has fixed the error using ECC. (Possibly ONLY + in that case...hmm). + + +1772 Floppy +----------- + This also seems to work OK, but hasn't been stressed much lately. It + hasn't got any code for disc change detection in there at the moment which + could be a bit of a problem! Suggestions on the correct way to do this + are welcome. + + +CONFIG_MACH_ and CONFIG_ARCH_ +----------------------------- + A change was made in 2003 to the macro names for new machines. + Historically, CONFIG_ARCH_ was used for the bonafide architecture, + e.g. SA1100, as well as implementations of the architecture, + e.g. Assabet. It was decided to change the implementation macros + to read CONFIG_MACH_ for clarity. Moreover, a retroactive fixup has + not been made because it would complicate patching. + + Previous registrations may be found online. + + + +Kernel entry (head.S) +-------------------------- + The initial entry into the kernel is via head.S, which uses machine + independent code. The machine is selected by the value of 'r1' on + entry, which must be kept unique. + + Due to the large number of machines which the ARM port of Linux provides + for, we have a method to manage this which ensures that we don't end up + duplicating large amounts of code. + + We group machine (or platform) support code into machine classes. A + class typically based around one or more system on a chip devices, and + acts as a natural container around the actual implementations. These + classes are given directories - arch/arm/mach- and + include/asm-arm/arch- - which contain the source files to + support the machine class. This directories also contain any machine + specific supporting code. + + For example, the SA1100 class is based upon the SA1100 and SA1110 SoC + devices, and contains the code to support the way the on-board and off- + board devices are used, or the device is setup, and provides that + machine specific "personality." + + This fine-grained machine specific selection is controlled by the machine + type ID, which acts both as a run-time and a compile-time code selection + method. + + You can register a new machine via the web site at: + + + +--- +Russell King (15/03/2004) diff --git a/Documentation/arm/SA1100/ADSBitsy b/Documentation/arm/SA1100/ADSBitsy new file mode 100644 index 000000000000..ab47c3833908 --- /dev/null +++ b/Documentation/arm/SA1100/ADSBitsy @@ -0,0 +1,43 @@ +ADS Bitsy Single Board Computer +(It is different from Bitsy(iPAQ) of Compaq) + +For more details, contact Applied Data Systems or see +http://www.applieddata.net/products.html + +The Linux support for this product has been provided by +Woojung Huh + +Use 'make adsbitsy_config' before any 'make config'. +This will set up defaults for ADS Bitsy support. + +The kernel zImage is linked to be loaded and executed at 0xc0400000. + +Linux can be used with the ADS BootLoader that ships with the +newer rev boards. See their documentation on how to load Linux. + +Supported peripherals: +- SA1100 LCD frame buffer (8/16bpp...sort of) +- SA1111 USB Master +- SA1100 serial port +- pcmcia, compact flash +- touchscreen(ucb1200) +- console on LCD screen +- serial ports (ttyS[0-2]) + - ttyS0 is default for serial console + +To do: +- everything else! :-) + +Notes: + +- The flash on board is divided into 3 partitions. + You should be careful to use flash on board. + It's partition is different from GraphicsClient Plus and GraphicsMaster + +- 16bpp mode requires a different cable than what ships with the board. + Contact ADS or look through the manual to wire your own. Currently, + if you compile with 16bit mode support and switch into a lower bpp + mode, the timing is off so the image is corrupted. This will be + fixed soon. + +Any contribution can be sent to nico@cam.org and will be greatly welcome! diff --git a/Documentation/arm/SA1100/Assabet b/Documentation/arm/SA1100/Assabet new file mode 100644 index 000000000000..cbbe5587c78d --- /dev/null +++ b/Documentation/arm/SA1100/Assabet @@ -0,0 +1,301 @@ +The Intel Assabet (SA-1110 evaluation) board +============================================ + +Please see: +http://developer.intel.com/design/strong/quicklist/eval-plat/sa-1110.htm +http://developer.intel.com/design/strong/guides/278278.htm + +Also some notes from John G Dorsey : +http://www.cs.cmu.edu/~wearable/software/assabet.html + + +Building the kernel +------------------- + +To build the kernel with current defaults: + + make assabet_config + make oldconfig + make zImage + +The resulting kernel image should be available in linux/arch/arm/boot/zImage. + + +Installing a bootloader +----------------------- + +A couple of bootloaders able to boot Linux on Assabet are available: + +BLOB (http://www.lart.tudelft.nl/lartware/blob/) + + BLOB is a bootloader used within the LART project. Some contributed + patches were merged into BLOB to add support for Assabet. + +Compaq's Bootldr + John Dorsey's patch for Assabet support +(http://www.handhelds.org/Compaq/bootldr.html) +(http://www.wearablegroup.org/software/bootldr/) + + Bootldr is the bootloader developed by Compaq for the iPAQ Pocket PC. + John Dorsey has produced add-on patches to add support for Assabet and + the JFFS filesystem. + +RedBoot (http://sources.redhat.com/redboot/) + + RedBoot is a bootloader developed by Red Hat based on the eCos RTOS + hardware abstraction layer. It supports Assabet amongst many other + hardware platforms. + +RedBoot is currently the recommended choice since it's the only one to have +networking support, and is the most actively maintained. + +Brief examples on how to boot Linux with RedBoot are shown below. But first +you need to have RedBoot installed in your flash memory. A known to work +precompiled RedBoot binary is available from the following location: + +ftp://ftp.netwinder.org/users/n/nico/ +ftp://ftp.arm.linux.org.uk/pub/linux/arm/people/nico/ +ftp://ftp.handhelds.org/pub/linux/arm/sa-1100-patches/ + +Look for redboot-assabet*.tgz. Some installation infos are provided in +redboot-assabet*.txt. + + +Initial RedBoot configuration +----------------------------- + +The commands used here are explained in The RedBoot User's Guide available +on-line at http://sources.redhat.com/ecos/docs-latest/redboot/redboot.html. +Please refer to it for explanations. + +If you have a CF network card (my Assabet kit contained a CF+ LP-E from +Socket Communications Inc.), you should strongly consider using it for TFTP +file transfers. You must insert it before RedBoot runs since it can't detect +it dynamically. + +To initialize the flash directory: + + fis init -f + +To initialize the non-volatile settings, like whether you want to use BOOTP or +a static IP address, etc, use this command: + + fconfig -i + + +Writing a kernel image into flash +--------------------------------- + +First, the kernel image must be loaded into RAM. If you have the zImage file +available on a TFTP server: + + load zImage -r -b 0x100000 + +If you rather want to use Y-Modem upload over the serial port: + + load -m ymodem -r -b 0x100000 + +To write it to flash: + + fis create "Linux kernel" -b 0x100000 -l 0xc0000 + + +Booting the kernel +------------------ + +The kernel still requires a filesystem to boot. A ramdisk image can be loaded +as follows: + + load ramdisk_image.gz -r -b 0x800000 + +Again, Y-Modem upload can be used instead of TFTP by replacing the file name +by '-y ymodem'. + +Now the kernel can be retrieved from flash like this: + + fis load "Linux kernel" + +or loaded as described previously. To boot the kernel: + + exec -b 0x100000 -l 0xc0000 + +The ramdisk image could be stored into flash as well, but there are better +solutions for on-flash filesystems as mentioned below. + + +Using JFFS2 +----------- + +Using JFFS2 (the Second Journalling Flash File System) is probably the most +convenient way to store a writable filesystem into flash. JFFS2 is used in +conjunction with the MTD layer which is responsible for low-level flash +management. More information on the Linux MTD can be found on-line at: +http://www.linux-mtd.infradead.org/. A JFFS howto with some infos about +creating JFFS/JFFS2 images is available from the same site. + +For instance, a sample JFFS2 image can be retrieved from the same FTP sites +mentioned below for the precompiled RedBoot image. + +To load this file: + + load sample_img.jffs2 -r -b 0x100000 + +The result should look like: + +RedBoot> load sample_img.jffs2 -r -b 0x100000 +Raw file loaded 0x00100000-0x00377424 + +Now we must know the size of the unallocated flash: + + fis free + +Result: + +RedBoot> fis free + 0x500E0000 .. 0x503C0000 + +The values above may be different depending on the size of the filesystem and +the type of flash. See their usage below as an example and take care of +substituting yours appropriately. + +We must determine some values: + +size of unallocated flash: 0x503c0000 - 0x500e0000 = 0x2e0000 +size of the filesystem image: 0x00377424 - 0x00100000 = 0x277424 + +We want to fit the filesystem image of course, but we also want to give it all +the remaining flash space as well. To write it: + + fis unlock -f 0x500E0000 -l 0x2e0000 + fis erase -f 0x500E0000 -l 0x2e0000 + fis write -b 0x100000 -l 0x277424 -f 0x500E0000 + fis create "JFFS2" -n -f 0x500E0000 -l 0x2e0000 + +Now the filesystem is associated to a MTD "partition" once Linux has discovered +what they are in the boot process. From Redboot, the 'fis list' command +displays them: + +RedBoot> fis list +Name FLASH addr Mem addr Length Entry point +RedBoot 0x50000000 0x50000000 0x00020000 0x00000000 +RedBoot config 0x503C0000 0x503C0000 0x00020000 0x00000000 +FIS directory 0x503E0000 0x503E0000 0x00020000 0x00000000 +Linux kernel 0x50020000 0x00100000 0x000C0000 0x00000000 +JFFS2 0x500E0000 0x500E0000 0x002E0000 0x00000000 + +However Linux should display something like: + +SA1100 flash: probing 32-bit flash bus +SA1100 flash: Found 2 x16 devices at 0x0 in 32-bit mode +Using RedBoot partition definition +Creating 5 MTD partitions on "SA1100 flash": +0x00000000-0x00020000 : "RedBoot" +0x00020000-0x000e0000 : "Linux kernel" +0x000e0000-0x003c0000 : "JFFS2" +0x003c0000-0x003e0000 : "RedBoot config" +0x003e0000-0x00400000 : "FIS directory" + +What's important here is the position of the partition we are interested in, +which is the third one. Within Linux, this correspond to /dev/mtdblock2. +Therefore to boot Linux with the kernel and its root filesystem in flash, we +need this RedBoot command: + + fis load "Linux kernel" + exec -b 0x100000 -l 0xc0000 -c "root=/dev/mtdblock2" + +Of course other filesystems than JFFS might be used, like cramfs for example. +You might want to boot with a root filesystem over NFS, etc. It is also +possible, and sometimes more convenient, to flash a filesystem directly from +within Linux while booted from a ramdisk or NFS. The Linux MTD repository has +many tools to deal with flash memory as well, to erase it for example. JFFS2 +can then be mounted directly on a freshly erased partition and files can be +copied over directly. Etc... + + +RedBoot scripting +----------------- + +All the commands above aren't so useful if they have to be typed in every +time the Assabet is rebooted. Therefore it's possible to automatize the boot +process using RedBoot's scripting capability. + +For example, I use this to boot Linux with both the kernel and the ramdisk +images retrieved from a TFTP server on the network: + +RedBoot> fconfig +Run script at boot: false true +Boot script: +Enter script, terminate with empty line +>> load zImage -r -b 0x100000 +>> load ramdisk_ks.gz -r -b 0x800000 +>> exec -b 0x100000 -l 0xc0000 +>> +Boot script timeout (1000ms resolution): 3 +Use BOOTP for network configuration: true +GDB connection port: 9000 +Network debug at boot time: false +Update RedBoot non-volatile configuration - are you sure (y/n)? y + +Then, rebooting the Assabet is just a matter of waiting for the login prompt. + + + +Nicolas Pitre +nico@cam.org +June 12, 2001 + + +Status of peripherals in -rmk tree (updated 14/10/2001) +------------------------------------------------------- + +Assabet: + Serial ports: + Radio: TX, RX, CTS, DSR, DCD, RI + PM: Not tested. + COM: TX, RX, CTS, DSR, DCD, RTS, DTR, PM + PM: Not tested. + I2C: Implemented, not fully tested. + L3: Fully tested, pass. + PM: Not tested. + + Video: + LCD: Fully tested. PM + (LCD doesn't like being blanked with + neponset connected) + Video out: Not fully + + Audio: + UDA1341: + Playback: Fully tested, pass. + Record: Implemented, not tested. + PM: Not tested. + + UCB1200: + Audio play: Implemented, not heavily tested. + Audio rec: Implemented, not heavily tested. + Telco audio play: Implemented, not heavily tested. + Telco audio rec: Implemented, not heavily tested. + POTS control: No + Touchscreen: Yes + PM: Not tested. + + Other: + PCMCIA: + LPE: Fully tested, pass. + USB: No + IRDA: + SIR: Fully tested, pass. + FIR: Fully tested, pass. + PM: Not tested. + +Neponset: + Serial ports: + COM1,2: TX, RX, CTS, DSR, DCD, RTS, DTR + PM: Not tested. + USB: Implemented, not heavily tested. + PCMCIA: Implemented, not heavily tested. + PM: Not tested. + CF: Implemented, not heavily tested. + PM: Not tested. + +More stuff can be found in the -np (Nicolas Pitre's) tree. + diff --git a/Documentation/arm/SA1100/Brutus b/Documentation/arm/SA1100/Brutus new file mode 100644 index 000000000000..2254c8f0b326 --- /dev/null +++ b/Documentation/arm/SA1100/Brutus @@ -0,0 +1,66 @@ +Brutus is an evaluation platform for the SA1100 manufactured by Intel. +For more details, see: + +http://developer.intel.com/design/strong/applnots/sa1100lx/getstart.htm + +To compile for Brutus, you must issue the following commands: + + make brutus_config + make config + [accept all the defaults] + make zImage + +The resulting kernel will end up in linux/arch/arm/boot/zImage. This file +must be loaded at 0xc0008000 in Brutus's memory and execution started at +0xc0008000 as well with the value of registers r0 = 0 and r1 = 16 upon +entry. + +But prior to execute the kernel, a ramdisk image must also be loaded in +memory. Use memory address 0xd8000000 for this. Note that the file +containing the (compressed) ramdisk image must not exceed 4 MB. + +Typically, you'll need angelboot to load the kernel. +The following angelboot.opt file should be used: + +----- begin angelboot.opt ----- +base 0xc0008000 +entry 0xc0008000 +r0 0x00000000 +r1 0x00000010 +device /dev/ttyS0 +options "9600 8N1" +baud 115200 +otherfile ramdisk_img.gz +otherbase 0xd8000000 +----- end angelboot.opt ----- + +Then load the kernel and ramdisk with: + + angelboot -f angelboot.opt zImage + +The first Brutus serial port (assumed to be linked to /dev/ttyS0 on your +host PC) is used by angel to load the kernel and ramdisk image. The serial +console is provided through the second Brutus serial port. To access it, +you may use minicom configured with /dev/ttyS1, 9600 baud, 8N1, no flow +control. + +Currently supported: + - RS232 serial ports + - audio output + - LCD screen + - keyboard + +The actual Brutus support may not be complete without extra patches. +If such patches exist, they should be found from +ftp.netwinder.org/users/n/nico. + +A full PCMCIA support is still missing, although it's possible to hack +some drivers in order to drive already inserted cards at boot time with +little modifications. + +Any contribution is welcome. + +Please send patches to nico@cam.org + +Have Fun ! + diff --git a/Documentation/arm/SA1100/CERF b/Documentation/arm/SA1100/CERF new file mode 100644 index 000000000000..b3d845301ef1 --- /dev/null +++ b/Documentation/arm/SA1100/CERF @@ -0,0 +1,29 @@ +*** The StrongARM version of the CerfBoard/Cube has been discontinued *** + +The Intrinsyc CerfBoard is a StrongARM 1110-based computer on a board +that measures approximately 2" square. It includes an Ethernet +controller, an RS232-compatible serial port, a USB function port, and +one CompactFlash+ slot on the back. Pictures can be found at the +Intrinsyc website, http://www.intrinsyc.com. + +This document describes the support in the Linux kernel for the +Intrinsyc CerfBoard. + +Supported in this version: + - CompactFlash+ slot (select PCMCIA in General Setup and any options + that may be required) + - Onboard Crystal CS8900 Ethernet controller (Cerf CS8900A support in + Network Devices) + - Serial ports with a serial console (hardcoded to 38400 8N1) + +In order to get this kernel onto your Cerf, you need a server that runs +both BOOTP and TFTP. Detailed instructions should have come with your +evaluation kit on how to use the bootloader. This series of commands +will suffice: + + make ARCH=arm CROSS_COMPILE=arm-linux- cerfcube_defconfig + make ARCH=arm CROSS_COMPILE=arm-linux- zImage + make ARCH=arm CROSS_COMPILE=arm-linux- modules + cp arch/arm/boot/zImage + +support@intrinsyc.com diff --git a/Documentation/arm/SA1100/FreeBird b/Documentation/arm/SA1100/FreeBird new file mode 100644 index 000000000000..eda28b3232e7 --- /dev/null +++ b/Documentation/arm/SA1100/FreeBird @@ -0,0 +1,21 @@ +Freebird-1.1 is produced by Legned(C) ,Inc. +(http://www.legend.com.cn) +and software/linux mainatined by Coventive(C),Inc. +(http://www.coventive.com) + +Based on the Nicolas's strongarm kernel tree. + +=============================================================== +Maintainer: + +Chester Kuo + + +Author : +Tim wu +CIH +Eric Peng +Jeff Lee +Allen Cheng +Tony Liu + diff --git a/Documentation/arm/SA1100/GraphicsClient b/Documentation/arm/SA1100/GraphicsClient new file mode 100644 index 000000000000..8fa7e8027ff1 --- /dev/null +++ b/Documentation/arm/SA1100/GraphicsClient @@ -0,0 +1,98 @@ +ADS GraphicsClient Plus Single Board Computer + +For more details, contact Applied Data Systems or see +http://www.applieddata.net/products.html + +The original Linux support for this product has been provided by +Nicolas Pitre . Continued development work by +Woojung Huh + +It's currently possible to mount a root filesystem via NFS providing a +complete Linux environment. Otherwise a ramdisk image may be used. The +board supports MTD/JFFS, so you could also mount something on there. + +Use 'make graphicsclient_config' before any 'make config'. This will set up +defaults for GraphicsClient Plus support. + +The kernel zImage is linked to be loaded and executed at 0xc0200000. +Also the following registers should have the specified values upon entry: + + r0 = 0 + r1 = 29 (this is the GraphicsClient architecture number) + +Linux can be used with the ADS BootLoader that ships with the +newer rev boards. See their documentation on how to load Linux. +Angel is not available for the GraphicsClient Plus AFAIK. + +There is a board known as just the GraphicsClient that ADS used to +produce but has end of lifed. This code will not work on the older +board with the ADS bootloader, but should still work with Angel, +as outlined below. In any case, if you're planning on deploying +something en masse, you should probably get the newer board. + +If using Angel on the older boards, here is a typical angel.opt option file +if the kernel is loaded through the Angel Debug Monitor: + +----- begin angelboot.opt ----- +base 0xc0200000 +entry 0xc0200000 +r0 0x00000000 +r1 0x0000001d +device /dev/ttyS1 +options "38400 8N1" +baud 115200 +#otherfile ramdisk.gz +#otherbase 0xc0800000 +exec minicom +----- end angelboot.opt ----- + +Then the kernel (and ramdisk if otherfile/otherbase lines above are +uncommented) would be loaded with: + + angelboot -f angelboot.opt zImage + +Here it is assumed that the board is connected to ttyS1 on your PC +and that minicom is preconfigured with /dev/ttyS1, 38400 baud, 8N1, no flow +control by default. + +If any other bootloader is used, ensure it accomplish the same, especially +for r0/r1 register values before jumping into the kernel. + + +Supported peripherals: +- SA1100 LCD frame buffer (8/16bpp...sort of) +- on-board SMC 92C96 ethernet NIC +- SA1100 serial port +- flash memory access (MTD/JFFS) +- pcmcia +- touchscreen(ucb1200) +- ps/2 keyboard +- console on LCD screen +- serial ports (ttyS[0-2]) + - ttyS0 is default for serial console +- Smart I/O (ADC, keypad, digital inputs, etc) + See http://www.applieddata.com/developers/linux for IOCTL documentation + and example user space code. ps/2 keybd is multiplexed through this driver + +To do: +- UCB1200 audio with new ucb_generic layer +- everything else! :-) + +Notes: + +- The flash on board is divided into 3 partitions. mtd0 is where + the ADS boot ROM and zImage is stored. It's been marked as + read-only to keep you from blasting over the bootloader. :) mtd1 is + for the ramdisk.gz image. mtd2 is user flash space and can be + utilized for either JFFS or if you're feeling crazy, running ext2 + on top of it. If you're not using the ADS bootloader, you're + welcome to blast over the mtd1 partition also. + +- 16bpp mode requires a different cable than what ships with the board. + Contact ADS or look through the manual to wire your own. Currently, + if you compile with 16bit mode support and switch into a lower bpp + mode, the timing is off so the image is corrupted. This will be + fixed soon. + +Any contribution can be sent to nico@cam.org and will be greatly welcome! + diff --git a/Documentation/arm/SA1100/GraphicsMaster b/Documentation/arm/SA1100/GraphicsMaster new file mode 100644 index 000000000000..dd28745ac521 --- /dev/null +++ b/Documentation/arm/SA1100/GraphicsMaster @@ -0,0 +1,53 @@ +ADS GraphicsMaster Single Board Computer + +For more details, contact Applied Data Systems or see +http://www.applieddata.net/products.html + +The original Linux support for this product has been provided by +Nicolas Pitre . Continued development work by +Woojung Huh + +Use 'make graphicsmaster_config' before any 'make config'. +This will set up defaults for GraphicsMaster support. + +The kernel zImage is linked to be loaded and executed at 0xc0400000. + +Linux can be used with the ADS BootLoader that ships with the +newer rev boards. See their documentation on how to load Linux. + +Supported peripherals: +- SA1100 LCD frame buffer (8/16bpp...sort of) +- SA1111 USB Master +- on-board SMC 92C96 ethernet NIC +- SA1100 serial port +- flash memory access (MTD/JFFS) +- pcmcia, compact flash +- touchscreen(ucb1200) +- ps/2 keyboard +- console on LCD screen +- serial ports (ttyS[0-2]) + - ttyS0 is default for serial console +- Smart I/O (ADC, keypad, digital inputs, etc) + See http://www.applieddata.com/developers/linux for IOCTL documentation + and example user space code. ps/2 keybd is multiplexed through this driver + +To do: +- everything else! :-) + +Notes: + +- The flash on board is divided into 3 partitions. mtd0 is where + the zImage is stored. It's been marked as read-only to keep you + from blasting over the bootloader. :) mtd1 is + for the ramdisk.gz image. mtd2 is user flash space and can be + utilized for either JFFS or if you're feeling crazy, running ext2 + on top of it. If you're not using the ADS bootloader, you're + welcome to blast over the mtd1 partition also. + +- 16bpp mode requires a different cable than what ships with the board. + Contact ADS or look through the manual to wire your own. Currently, + if you compile with 16bit mode support and switch into a lower bpp + mode, the timing is off so the image is corrupted. This will be + fixed soon. + +Any contribution can be sent to nico@cam.org and will be greatly welcome! diff --git a/Documentation/arm/SA1100/HUW_WEBPANEL b/Documentation/arm/SA1100/HUW_WEBPANEL new file mode 100644 index 000000000000..fd56b48d4833 --- /dev/null +++ b/Documentation/arm/SA1100/HUW_WEBPANEL @@ -0,0 +1,17 @@ +The HUW_WEBPANEL is a product of the german company Hoeft & Wessel AG + +If you want more information, please visit +http://www.hoeft-wessel.de + +To build the kernel: + make huw_webpanel_config + make oldconfig + [accept all defaults] + make zImage + +Mostly of the work is done by: +Roman Jordan jor@hoeft-wessel.de +Christoph Schulz schu@hoeft-wessel.de + +2000/12/18/ + diff --git a/Documentation/arm/SA1100/Itsy b/Documentation/arm/SA1100/Itsy new file mode 100644 index 000000000000..3b594534323b --- /dev/null +++ b/Documentation/arm/SA1100/Itsy @@ -0,0 +1,39 @@ +Itsy is a research project done by the Western Research Lab, and Systems +Research Center in Palo Alto, CA. The Itsy project is one of several +research projects at Compaq that are related to pocket computing. + +For more information, see: + + http://www.research.digital.com/wrl/itsy/index.html + +Notes on initial 2.4 Itsy support (8/27/2000) : +The port was done on an Itsy version 1.5 machine with a daughtercard with +64 Meg of DRAM and 32 Meg of Flash. The initial work includes support for +serial console (to see what you're doing). No other devices have been +enabled. + +To build, do a "make menuconfig" (or xmenuconfig) and select Itsy support. +Disable Flash and LCD support. and then do a make zImage. +Finally, you will need to cd to arch/arm/boot/tools and execute a make there +to build the params-itsy program used to boot the kernel. + +In order to install the port of 2.4 to the itsy, You will need to set the +configuration parameters in the monitor as follows: +Arg 1:0x08340000, Arg2: 0xC0000000, Arg3:18 (0x12), Arg4:0 +Make sure the start-routine address is set to 0x00060000. + +Next, flash the params-itsy program to 0x00060000 ("p 1 0x00060000" in the +flash menu) Flash the kernel in arch/arm/boot/zImage into 0x08340000 +("p 1 0x00340000"). Finally flash an initial ramdisk into 0xC8000000 +("p 2 0x0") We used ramdisk-2-30.gz from the 0.11 version directory on +handhelds.org. + +The serial connection we established was at: + 8-bit data, no parity, 1 stop bit(s), 115200.00 b/s. in the monitor, in the +params-itsy program, and in the kernel itself. This can be changed, but +not easily. The monitor parameters are easily changed, the params program +setup is assembly outl's, and the kernel is a configuration item specific to +the itsy. (i.e. grep for CONFIG_SA1100_ITSY and you'll find where it is.) + + +This should get you a properly booting 2.4 kernel on the itsy. diff --git a/Documentation/arm/SA1100/LART b/Documentation/arm/SA1100/LART new file mode 100644 index 000000000000..2f73f513e16a --- /dev/null +++ b/Documentation/arm/SA1100/LART @@ -0,0 +1,14 @@ +Linux Advanced Radio Terminal (LART) +------------------------------------ + +The LART is a small (7.5 x 10cm) SA-1100 board, designed for embedded +applications. It has 32 MB DRAM, 4MB Flash ROM, double RS232 and all +other StrongARM-gadgets. Almost all SA signals are directly accessible +through a number of connectors. The powersupply accepts voltages +between 3.5V and 16V and is overdimensioned to support a range of +daughterboards. A quad Ethernet / IDE / PS2 / sound daughterboard +is under development, with plenty of others in different stages of +planning. + +The hardware designs for this board have been released under an open license; +see the LART page at http://www.lart.tudelft.nl/ for more information. diff --git a/Documentation/arm/SA1100/PLEB b/Documentation/arm/SA1100/PLEB new file mode 100644 index 000000000000..92cae066908d --- /dev/null +++ b/Documentation/arm/SA1100/PLEB @@ -0,0 +1,11 @@ +The PLEB project was started as a student initiative at the School of +Computer Science and Engineering, University of New South Wales to make a +pocket computer capable of running the Linux Kernel. + +PLEB support has yet to be fully integrated. + +For more information, see: + + http://www.cse.unsw.edu.au/~pleb/ + + diff --git a/Documentation/arm/SA1100/Pangolin b/Documentation/arm/SA1100/Pangolin new file mode 100644 index 000000000000..077a6120e129 --- /dev/null +++ b/Documentation/arm/SA1100/Pangolin @@ -0,0 +1,23 @@ +Pangolin is a StrongARM 1110-based evaluation platform produced +by Dialogue Technology (http://www.dialogue.com.tw/). +It has EISA slots for ease of configuration with SDRAM/Flash +memory card, USB/Serial/Audio card, Compact Flash card, +PCMCIA/IDE card and TFT-LCD card. + +To compile for Pangolin, you must issue the following commands: + + make pangolin_config + make oldconfig + make zImage + +Supported peripherals: +- SA1110 serial port (UART1/UART2/UART3) +- flash memory access +- compact flash driver +- UDA1341 sound driver +- SA1100 LCD controller for 800x600 16bpp TFT-LCD +- MQ-200 driver for 800x600 16bpp TFT-LCD +- Penmount(touch panel) driver +- PCMCIA driver +- SMC91C94 LAN driver +- IDE driver (experimental) diff --git a/Documentation/arm/SA1100/Tifon b/Documentation/arm/SA1100/Tifon new file mode 100644 index 000000000000..dd1934d9c851 --- /dev/null +++ b/Documentation/arm/SA1100/Tifon @@ -0,0 +1,7 @@ +Tifon +----- + +More info has to come... + +Contact: Peter Danielsson + diff --git a/Documentation/arm/SA1100/Victor b/Documentation/arm/SA1100/Victor new file mode 100644 index 000000000000..01e81fc49461 --- /dev/null +++ b/Documentation/arm/SA1100/Victor @@ -0,0 +1,16 @@ +Victor is known as a "digital talking book player" manufactured by +VisuAide, Inc. to be used by blind people. + +For more information related to Victor, see: + + http://www.visuaide.com/victor + +Of course Victor is using Linux as its main operating system. +The Victor implementation for Linux is maintained by Nicolas Pitre: + + nico@visuaide.com + nico@cam.org + +For any comments, please feel free to contact me through the above +addresses. + diff --git a/Documentation/arm/SA1100/Yopy b/Documentation/arm/SA1100/Yopy new file mode 100644 index 000000000000..e14f16d836ac --- /dev/null +++ b/Documentation/arm/SA1100/Yopy @@ -0,0 +1,2 @@ +See http://www.yopydeveloper.org for more. + diff --git a/Documentation/arm/SA1100/empeg b/Documentation/arm/SA1100/empeg new file mode 100644 index 000000000000..4ece4849a42c --- /dev/null +++ b/Documentation/arm/SA1100/empeg @@ -0,0 +1,2 @@ +See ../empeg/README + diff --git a/Documentation/arm/SA1100/nanoEngine b/Documentation/arm/SA1100/nanoEngine new file mode 100644 index 000000000000..fc431cbfefc2 --- /dev/null +++ b/Documentation/arm/SA1100/nanoEngine @@ -0,0 +1,11 @@ +nanoEngine +---------- + +"nanoEngine" is a SA1110 based single board computer from +Bright Star Engineering Inc. See www.brightstareng.com/arm +for more info. +(Ref: Stuart Adams ) + +Also visit Larry Doolittle's "Linux for the nanoEngine" site: +http://recycle.lbl.gov/~ldoolitt/bse/ + diff --git a/Documentation/arm/SA1100/serial_UART b/Documentation/arm/SA1100/serial_UART new file mode 100644 index 000000000000..aea2e91ca0ef --- /dev/null +++ b/Documentation/arm/SA1100/serial_UART @@ -0,0 +1,47 @@ +The SA1100 serial port had its major/minor numbers officially assigned: + +> Date: Sun, 24 Sep 2000 21:40:27 -0700 +> From: H. Peter Anvin +> To: Nicolas Pitre +> Cc: Device List Maintainer +> Subject: Re: device +> +> Okay. Note that device numbers 204 and 205 are used for "low density +> serial devices", so you will have a range of minors on those majors (the +> tty device layer handles this just fine, so you don't have to worry about +> doing anything special.) +> +> So your assignments are: +> +> 204 char Low-density serial ports +> 5 = /dev/ttySA0 SA1100 builtin serial port 0 +> 6 = /dev/ttySA1 SA1100 builtin serial port 1 +> 7 = /dev/ttySA2 SA1100 builtin serial port 2 +> +> 205 char Low-density serial ports (alternate device) +> 5 = /dev/cusa0 Callout device for ttySA0 +> 6 = /dev/cusa1 Callout device for ttySA1 +> 7 = /dev/cusa2 Callout device for ttySA2 +> + +If you're not using devfs, you must create those inodes in /dev +on the root filesystem used by your SA1100-based device: + + mknod ttySA0 c 204 5 + mknod ttySA1 c 204 6 + mknod ttySA2 c 204 7 + mknod cusa0 c 205 5 + mknod cusa1 c 205 6 + mknod cusa2 c 205 7 + +In addition to the creation of the appropriate device nodes above, you +must ensure your user space applications make use of the correct device +name. The classic example is the content of the /etc/inittab file where +you might have a getty process started on ttyS0. In this case: + +- replace occurrences of ttyS0 with ttySA0, ttyS1 with ttySA1, etc. + +- don't forget to add 'ttySA0', 'console', or the appropriate tty name + in /etc/securetty for root to be allowed to login as well. + + diff --git a/Documentation/arm/Samsung-S3C24XX/EB2410ITX.txt b/Documentation/arm/Samsung-S3C24XX/EB2410ITX.txt new file mode 100644 index 000000000000..000e3d7a78b2 --- /dev/null +++ b/Documentation/arm/Samsung-S3C24XX/EB2410ITX.txt @@ -0,0 +1,58 @@ + Simtec Electronics EB2410ITX (BAST) + =================================== + + http://www.simtec.co.uk/products/EB2410ITX/ + +Introduction +------------ + + The EB2410ITX is a S3C2410 based development board with a variety of + peripherals and expansion connectors. This board is also known by + the shortened name of Bast. + + +Configuration +------------- + + To set the default configuration, use `make bast_defconfig` which + supports the commonly used features of this board. + + +Support +------- + + Official support information can be found on the Simtec Electronics + website, at the product page http://www.simtec.co.uk/products/EB2410ITX/ + + Useful links: + + - Resources Page http://www.simtec.co.uk/products/EB2410ITX/resources.html + + - Board FAQ at http://www.simtec.co.uk/products/EB2410ITX/faq.html + + - Bootloader info http://www.simtec.co.uk/products/SWABLE/resources.html + and FAQ http://www.simtec.co.uk/products/SWABLE/faq.html + + +MTD +--- + + The NAND and NOR support has been merged from the linux-mtd project. + Any prolbems, see http://www.linux-mtd.infradead.org/ for more + information or up-to-date versions of linux-mtd. + + +IDE +--- + + Both onboard IDE ports are supported, however there is no support for + changing speed of devices, PIO Mode 4 capable drives should be used. + + +Maintainers +----------- + + This board is maintained by Simtec Electronics. + + +(c) 2004 Ben Dooks, Simtec Electronics diff --git a/Documentation/arm/Samsung-S3C24XX/GPIO.txt b/Documentation/arm/Samsung-S3C24XX/GPIO.txt new file mode 100644 index 000000000000..0822764ec270 --- /dev/null +++ b/Documentation/arm/Samsung-S3C24XX/GPIO.txt @@ -0,0 +1,122 @@ + S3C2410 GPIO Control + ==================== + +Introduction +------------ + + The s3c2410 kernel provides an interface to configure and + manipulate the state of the GPIO pins, and find out other + information about them. + + There are a number of conditions attached to the configuration + of the s3c2410 GPIO system, please read the Samsung provided + data-sheet/users manual to find out the complete list. + + +Headers +------- + + See include/asm-arm/arch-s3c2410/regs-gpio.h for the list + of GPIO pins, and the configuration values for them. This + is included by using #include + + The GPIO management functions are defined in the hardware + header include/asm-arm/arch-s3c2410/hardware.h which can be + included by #include + + A useful ammount of documentation can be found in the hardware + header on how the GPIO functions (and others) work. + + Whilst a number of these functions do make some checks on what + is passed to them, for speed of use, they may not always ensure + that the user supplied data to them is correct. + + +PIN Numbers +----------- + + Each pin has an unique number associated with it in regs-gpio.h, + eg S3C2410_GPA0 or S3C2410_GPF1. These defines are used to tell + the GPIO functions which pin is to be used. + + +Configuring a pin +----------------- + + The following function allows the configuration of a given pin to + be changed. + + void s3c2410_gpio_cfgpin(unsigned int pin, unsigned int function); + + Eg: + + s3c2410_gpio_cfgpin(S3C2410_GPA0, S3C2410_GPA0_ADDR0); + s3c2410_gpio_cfgpin(S3C2410_GPE8, S3C2410_GPE8_SDDAT1); + + which would turn GPA0 into the lowest Address line A0, and set + GPE8 to be connected to the SDIO/MMC controller's SDDAT1 line. + + +Reading the current configuration +--------------------------------- + + The current configuration of a pin can be read by using: + + s3c2410_gpio_getcfg(unsigned int pin); + + The return value will be from the same set of values which can be + passed to s3c2410_gpio_cfgpin(). + + +Configuring a pull-up resistor +------------------------------ + + A large proportion of the GPIO pins on the S3C2410 can have weak + pull-up resistors enabled. This can be configured by the following + function: + + void s3c2410_gpio_pullup(unsigned int pin, unsigned int to); + + Where the to value is zero to set the pull-up off, and 1 to enable + the specified pull-up. Any other values are currently undefined. + + +Getting the state of a PIN +-------------------------- + + The state of a pin can be read by using the function: + + unsigned int s3c2410_gpio_getpin(unsigned int pin); + + This will return either zero or non-zero. Do not count on this + function returning 1 if the pin is set. + + +Setting the state of a PIN +-------------------------- + + The value an pin is outputing can be modified by using the following: + + void s3c2410_gpio_setpin(unsigned int pin, unsigned int to); + + Which sets the given pin to the value. Use 0 to write 0, and 1 to + set the output to 1. + + +Getting the IRQ number associated with a PIN +-------------------------------------------- + + The following function can map the given pin number to an IRQ + number to pass to the IRQ system. + + int s3c2410_gpio_getirq(unsigned int pin); + + Note, not all pins have an IRQ. + + +Authour +------- + + +Ben Dooks, 03 October 2004 +(c) 2004 Ben Dooks, Simtec Electronics diff --git a/Documentation/arm/Samsung-S3C24XX/H1940.txt b/Documentation/arm/Samsung-S3C24XX/H1940.txt new file mode 100644 index 000000000000..d6b1de92b111 --- /dev/null +++ b/Documentation/arm/Samsung-S3C24XX/H1940.txt @@ -0,0 +1,40 @@ + HP IPAQ H1940 + ============= + +http://www.handhelds.org/projects/h1940.html + +Introduction +------------ + + The HP H1940 is a S3C2410 based handheld device, with + bluetooth connectivity. + + +Support +------- + + A variety of information is available + + handhelds.org project page: + + http://www.handhelds.org/projects/h1940.html + + handhelds.org wiki page: + + http://handhelds.org/moin/moin.cgi/HpIpaqH1940 + + Herbert Pötzl pages: + + http://vserver.13thfloor.at/H1940/ + + +Maintainers +----------- + + This project is being maintained and developed by a variety + of people, including Ben Dooks, Arnaud Patard, and Herbert Pötzl. + + Thanks to the many others who have also provided support. + + +(c) 2005 Ben Dooks \ No newline at end of file diff --git a/Documentation/arm/Samsung-S3C24XX/Overview.txt b/Documentation/arm/Samsung-S3C24XX/Overview.txt new file mode 100644 index 000000000000..3af4d29a8938 --- /dev/null +++ b/Documentation/arm/Samsung-S3C24XX/Overview.txt @@ -0,0 +1,156 @@ + S3C24XX ARM Linux Overview + ========================== + + + +Introduction +------------ + + The Samsung S3C24XX range of ARM9 System-on-Chip CPUs are supported + by the 's3c2410' architecture of ARM Linux. Currently the S3C2410 and + the S3C2440 are supported CPUs. + + +Configuration +------------- + + A generic S3C2410 configuration is provided, and can be used as the + default by `make s3c2410_defconfig`. This configuration has support + for all the machines, and the commonly used features on them. + + Certain machines may have their own default configurations as well, + please check the machine specific documentation. + + +Machines +-------- + + The currently supported machines are as follows: + + Simtec Electronics EB2410ITX (BAST) + + A general purpose development board, see EB2410ITX.txt for further + details + + Samsung SMDK2410 + + Samsung's own development board, geared for PDA work. + + Samsung/Meritech SMDK2440 + + The S3C2440 compatible version of the SMDK2440 + + Thorcom VR1000 + + Custom embedded board + + HP IPAQ 1940 + + Handheld (IPAQ), available in several varieties + + HP iPAQ rx3715 + + S3C2440 based IPAQ, with a number of variations depending on + features shipped. + + Acer N30 + + A S3C2410 based PDA from Acer. There is a Wiki page at + http://handhelds.org/moin/moin.cgi/AcerN30Documentation . + + +Adding New Machines +------------------- + + The archicture has been designed to support as many machines as can + be configured for it in one kernel build, and any future additions + should keep this in mind before altering items outside of their own + machine files. + + Machine definitions should be kept in linux/arch/arm/mach-s3c2410, + and there are a number of examples that can be looked at. + + Read the kernel patch submission policies as well as the + Documentation/arm directory before submitting patches. The + ARM kernel series is managed by Russell King, and has a patch system + located at http://www.arm.linux.org.uk/developer/patches/ + as well as mailing lists that can be found from the same site. + + As a courtesy, please notify of any new + machines or other modifications. + + Any large scale modifications, or new drivers should be discussed + on the ARM kernel mailing list (linux-arm-kernel) before being + attempted. + + +NAND +---- + + The current kernels now have support for the s3c2410 NAND + controller. If there are any problems the latest linux-mtd + CVS can be found from http://www.linux-mtd.infradead.org/ + + +Serial +------ + + The s3c2410 serial driver provides support for the internal + serial ports. These devices appear as /dev/ttySAC0 through 3. + + To create device nodes for these, use the following commands + + mknod ttySAC0 c 204 64 + mknod ttySAC1 c 204 65 + mknod ttySAC2 c 204 66 + + +GPIO +---- + + The core contains support for manipulating the GPIO, see the + documentation in GPIO.txt in the same directory as this file. + + +Clock Management +---------------- + + The core provides the interface defined in the header file + include/asm-arm/hardware/clock.h, to allow control over the + various clock units + + +Port Contributors +----------------- + + Ben Dooks (BJD) + Vincent Sanders + Herbert Potzl + Arnaud Patard (RTP) + Roc Wu + Klaus Fetscher + Dimitry Andric + Shannon Holland + Guillaume Gourat (NexVision) + Christer Weinigel (wingel) (Acer N30) + Lucas Correia Villa Real (S3C2400 port) + + +Document Changes +---------------- + + 05 Sep 2004 - BJD - Added Document Changes section + 05 Sep 2004 - BJD - Added Klaus Fetscher to list of contributors + 25 Oct 2004 - BJD - Added Dimitry Andric to list of contributors + 25 Oct 2004 - BJD - Updated the MTD from the 2.6.9 merge + 21 Jan 2005 - BJD - Added rx3715, added Shannon to contributors + 10 Feb 2005 - BJD - Added Guillaume Gourat to contributors + 02 Mar 2005 - BJD - Added SMDK2440 to list of machines + 06 Mar 2005 - BJD - Added Christer Weinigel + 08 Mar 2005 - BJD - Added LCVR to list of people, updated introduction + 08 Mar 2005 - BJD - Added section on adding machines + +Document Author +--------------- + +Ben Dooks, (c) 2004-2005 Simtec Electronics diff --git a/Documentation/arm/Samsung-S3C24XX/SMDK2440.txt b/Documentation/arm/Samsung-S3C24XX/SMDK2440.txt new file mode 100644 index 000000000000..32e1eae6a25f --- /dev/null +++ b/Documentation/arm/Samsung-S3C24XX/SMDK2440.txt @@ -0,0 +1,56 @@ + Samsung/Meritech SMDK2440 + ========================= + +Introduction +------------ + + The SMDK2440 is a two part evaluation board for the Samsung S3C2440 + processor. It includes support for LCD, SmartMedia, Audio, SD and + 10MBit Ethernet, and expansion headers for various signals, including + the camera and unused GPIO. + + +Configuration +------------- + + To set the default configuration, use `make smdk2440_defconfig` which + will configure the common features of this board, or use + `make s3c2410_config` to include support for all s3c2410/s3c2440 machines + + +Support +------- + + Ben Dooks' SMDK2440 site at http://www.fluff.org/ben/smdk2440/ which + includes linux based USB download tools. + + Some of the h1940 patches that can be found from the H1940 project + site at http://www.handhelds.org/projects/h1940.html can also be + applied to this board. + + +Peripherals +----------- + + There is no current support for any of the extra peripherals on the + base-board itself. + + +MTD +--- + + The NAND flash should be supported by the in kernel MTD NAND support, + NOR flash will be added later. + + +Maintainers +----------- + + This board is being maintained by Ben Dooks, for more info, see + http://www.fluff.org/ben/smdk2440/ + + Many thanks to Dimitry Andric of TomTom for the loan of the SMDK2440, + and to Simtec Electronics for allowing me time to work on this. + + +(c) 2004 Ben Dooks \ No newline at end of file diff --git a/Documentation/arm/Samsung-S3C24XX/Suspend.txt b/Documentation/arm/Samsung-S3C24XX/Suspend.txt new file mode 100644 index 000000000000..e12bc3284a27 --- /dev/null +++ b/Documentation/arm/Samsung-S3C24XX/Suspend.txt @@ -0,0 +1,106 @@ + S3C24XX Suspend Support + ======================= + + +Introduction +------------ + + The S3C2410 supports a low-power suspend mode, where the SDRAM is kept + in Self-Refresh mode, and all but the essential peripheral blocks are + powered down. For more information on how this works, please look + at the S3C2410 datasheets from Samsung. + + +Requirements +------------ + + 1) A bootloader that can support the necessary resume operation + + 2) Support for at least 1 source for resume + + 3) CONFIG_PM enabled in the kernel + + 4) Any peripherals that are going to be powered down at the same + time require suspend/resume support. + + +Resuming +-------- + + The S3C2410 user manual defines the process of sending the CPU to + sleep and how it resumes. The default behaviour of the Linux code + is to set the GSTATUS3 register to the physical address of the + code to resume Linux operation. + + GSTATUS4 is currently left alone by the sleep code, and is free to + use for any other purposes (for example, the EB2410ITX uses this to + save memory configuration in). + + +Machine Support +--------------- + + The machine specific functions must call the s3c2410_pm_init() function + to say that its bootloader is capable of resuming. This can be as + simple as adding the following to the machine's definition: + + INITMACHINE(s3c2410_pm_init) + + A board can do its own setup before calling s3c2410_pm_init, if it + needs to setup anything else for power management support. + + There is currently no support for over-riding the default method of + saving the resume address, if your board requires it, then contact + the maintainer and discuss what is required. + + Note, the original method of adding an late_initcall() is wrong, + and will end up initialising all compiled machines' pm init! + + +Debugging +--------- + + There are several important things to remember when using PM suspend: + + 1) The uart drivers will disable the clocks to the UART blocks when + suspending, which means that use of printascii() or similar direct + access to the UARTs will cause the debug to stop. + + 2) Whilst the pm code itself will attempt to re-enable the UART clocks, + care should be taken that any external clock sources that the UARTs + rely on are still enabled at that point. + + +Configuration +------------- + + The S3C2410 specific configuration in `System Type` defines various + aspects of how the S3C2410 suspend and resume support is configured + + `S3C2410 PM Suspend debug` + + This option prints messages to the serial console before and after + the actual suspend, giving detailed information on what is + happening + + + `S3C2410 PM Suspend Memory CRC` + + Allows the entire memory to be checksummed before and after the + suspend to see if there has been any corruption of the contents. + + This support requires the CRC32 function to be enabled. + + + `S3C2410 PM Suspend CRC Chunksize (KiB)` + + Defines the size of memory each CRC chunk covers. A smaller value + will mean that the CRC data block will take more memory, but will + identify any faults with better precision + + +Document Author +--------------- + +Ben Dooks, (c) 2004 Simtec Electronics + diff --git a/Documentation/arm/Setup b/Documentation/arm/Setup new file mode 100644 index 000000000000..0abd0720d7ed --- /dev/null +++ b/Documentation/arm/Setup @@ -0,0 +1,129 @@ +Kernel initialisation parameters on ARM Linux +--------------------------------------------- + +The following document describes the kernel initialisation parameter +structure, otherwise known as 'struct param_struct' which is used +for most ARM Linux architectures. + +This structure is used to pass initialisation parameters from the +kernel loader to the Linux kernel proper, and may be short lived +through the kernel initialisation process. As a general rule, it +should not be referenced outside of arch/arm/kernel/setup.c:setup_arch(). + +There are a lot of parameters listed in there, and they are described +below: + + page_size + + This parameter must be set to the page size of the machine, and + will be checked by the kernel. + + nr_pages + + This is the total number of pages of memory in the system. If + the memory is banked, then this should contain the total number + of pages in the system. + + If the system contains separate VRAM, this value should not + include this information. + + ramdisk_size + + This is now obsolete, and should not be used. + + flags + + Various kernel flags, including: + bit 0 - 1 = mount root read only + bit 1 - unused + bit 2 - 0 = load ramdisk + bit 3 - 0 = prompt for ramdisk + + rootdev + + major/minor number pair of device to mount as the root filesystem. + + video_num_cols + video_num_rows + + These two together describe the character size of the dummy console, + or VGA console character size. They should not be used for any other + purpose. + + It's generally a good idea to set these to be either standard VGA, or + the equivalent character size of your fbcon display. This then allows + all the bootup messages to be displayed correctly. + + video_x + video_y + + This describes the character position of cursor on VGA console, and + is otherwise unused. (should not used for other console types, and + should not be used for other purposes). + + memc_control_reg + + MEMC chip control register for Acorn Archimedes and Acorn A5000 + based machines. May be used differently by different architectures. + + sounddefault + + Default sound setting on Acorn machines. May be used differently by + different architectures. + + adfsdrives + + Number of ADFS/MFM disks. May be used differently by different + architectures. + + bytes_per_char_h + bytes_per_char_v + + These are now obsolete, and should not be used. + + pages_in_bank[4] + + Number of pages in each bank of the systems memory (used for RiscPC). + This is intended to be used on systems where the physical memory + is non-contiguous from the processors point of view. + + pages_in_vram + + Number of pages in VRAM (used on Acorn RiscPC). This value may also + be used by loaders if the size of the video RAM can't be obtained + from the hardware. + + initrd_start + initrd_size + + This describes the kernel virtual start address and size of the + initial ramdisk. + + rd_start + + Start address in sectors of the ramdisk image on a floppy disk. + + system_rev + + system revision number. + + system_serial_low + system_serial_high + + system 64-bit serial number + + mem_fclk_21285 + + The speed of the external oscillator to the 21285 (footbridge), + which control's the speed of the memory bus, timer & serial port. + Depending upon the speed of the cpu its value can be between + 0-66 MHz. If no params are passed or a value of zero is passed, + then a value of 50 Mhz is the default on 21285 architectures. + + paths[8][128] + + These are now obsolete, and should not be used. + + commandline + + Kernel command line parameters. Details can be found elsewhere. diff --git a/Documentation/arm/Sharp-LH/CompactFlash b/Documentation/arm/Sharp-LH/CompactFlash new file mode 100644 index 000000000000..8616d877df9e --- /dev/null +++ b/Documentation/arm/Sharp-LH/CompactFlash @@ -0,0 +1,32 @@ +README on the Compact Flash for Card Engines +============================================ + +There are three challenges in supporting the CF interface of the Card +Engines. First, every IO operation must be followed with IO to +another memory region. Second, the slot is wired for one-to-one +address mapping *and* it is wired for 16 bit access only. Second, the +interrupt request line from the CF device isn't wired. + +The IOBARRIER issue is covered in README.IOBARRIER. This isn't an +onerous problem. Enough said here. + +The addressing issue is solved in the +arch/arm/mach-lh7a40x/ide-lpd7a40x.c file with some awkward +work-arounds. We implement a special SELECT_DRIVE routine that is +called before the IDE driver performs its own SELECT_DRIVE. Our code +recognizes that the SELECT register cannot be modified without also +writing a command. It send an IDLE_IMMEDIATE command on selecting a +drive. The function also prevents drive select to the slave drive +since there can be only one. The awkward part is that the IDE driver, +even though we have a select procedure, also attempts to change the +drive by writing directly the SELECT register. This attempt is +explicitly blocked by the OUTB function--not pretty, but effective. + +The lack of interrupts is a more serious problem. Even though the CF +card is fast when compared to a normal IDE device, we don't know that +the CF is really flash. A user could use one of the very small hard +drives being shipped with a CF interface. The IDE code includes a +check for interfaces that lack an IRQ. In these cases, submitting a +command to the IDE controller is followed by a call to poll for +completion. If the device isn't immediately ready, it schedules a +timer to poll again later. diff --git a/Documentation/arm/Sharp-LH/IOBarrier b/Documentation/arm/Sharp-LH/IOBarrier new file mode 100644 index 000000000000..c0d8853672dc --- /dev/null +++ b/Documentation/arm/Sharp-LH/IOBarrier @@ -0,0 +1,45 @@ +README on the IOBARRIER for CardEngine IO +========================================= + +Due to an unfortunate oversight when the Card Engines were designed, +the signals that control access to some peripherals, most notably the +SMC91C9111 ethernet controller, are not properly handled. + +The symptom is that some back to back IO with the peripheral returns +unreliable data. With the SMC chip, you'll see errors about the bank +register being 'screwed'. + +The cause is that the AEN signal to the SMC chip does not transition +for every memory access. It is driven through the CPLD from the CS7 +line of the CPU's static memory controller which is optimized to +eliminate unnecessary transitions. Yet, the SMC requires a transition +for every write access. The Sharp website has more information about +the effect this power-conserving feature has on peripheral +interfacing. + +The solution is to follow every write access to the SMC chip with an +access to another memory region that will force the CPU to release the +chip select line. It is important to guarantee that this access +forces the CPU off-chip. We map a page of SDRAM as if it were an +uncacheable IO device and read from it after every SMC IO write +operation. + + SMC IO + BARRIER IO + +Only this sequence is important. It does not matter that there is no +BARRIER IO before the access to the SMC chip because the AEN latch +only needs occurs after the SMC IO write cycle. The routines that +implement this work-around make an additional concession which is to +disable interrupts during the IO sequence. Other hardware devices +(the LogicPD CPLD) have registers in the same the physical memory +region as the SMC chip. An interrupt might allow an access to one of +those registers while SMC IO is being performed. + +You might be tempted to think that we have to access another device +attached to the static memory controller, but the empirical evidence +indicates that this is not so. Mapping 0x00000000 (flash) and +0xc0000000 (SDRAM) appear to have the same effect. Using SDRAM seems +to be faster. Choosing to access an undecoded memory region is not +desirable as there is no way to know how that chip select will be used +in the future. diff --git a/Documentation/arm/Sharp-LH/KEV7A400 b/Documentation/arm/Sharp-LH/KEV7A400 new file mode 100644 index 000000000000..be32b14cd535 --- /dev/null +++ b/Documentation/arm/Sharp-LH/KEV7A400 @@ -0,0 +1,8 @@ +README on Implementing Linux for Sharp's KEV7a400 +================================================= + +This product has been discontinued by Sharp. For the time being, the +partially implemented code remains in the kernel. At some point in +the future, either the code will be finished or it will be removed +completely. This depends primarily on how many of the development +boards are in the field. diff --git a/Documentation/arm/Sharp-LH/LPD7A400 b/Documentation/arm/Sharp-LH/LPD7A400 new file mode 100644 index 000000000000..3275b453bfdf --- /dev/null +++ b/Documentation/arm/Sharp-LH/LPD7A400 @@ -0,0 +1,15 @@ +README on Implementing Linux for the Logic PD LPD7A400-10 +========================================================= + +- CPLD memory mapping + + The board designers chose to use high address lines for controlling + access to the CPLD registers. It turns out to be a big waste + because we're using an MMU and must map IO space into virtual + memory. The result is that we have to make a mapping for every + register. + +- Serial Console + + It may be OK not to use the serial console option if the user passes + the console device name to the kernel. This deserves some exploration. diff --git a/Documentation/arm/Sharp-LH/LPD7A40X b/Documentation/arm/Sharp-LH/LPD7A40X new file mode 100644 index 000000000000..8c29a27e208f --- /dev/null +++ b/Documentation/arm/Sharp-LH/LPD7A40X @@ -0,0 +1,16 @@ +README on Implementing Linux for the Logic PD LPD7A40X-10 +========================================================= + +- CPLD memory mapping + + The board designers chose to use high address lines for controlling + access to the CPLD registers. It turns out to be a big waste + because we're using an MMU and must map IO space into virtual + memory. The result is that we have to make a mapping for every + register. + +- Serial Console + + It may be OK not to use the serial console option if the user passes + the console device name to the kernel. This deserves some exploration. + diff --git a/Documentation/arm/Sharp-LH/SDRAM b/Documentation/arm/Sharp-LH/SDRAM new file mode 100644 index 000000000000..93ddc23c2faa --- /dev/null +++ b/Documentation/arm/Sharp-LH/SDRAM @@ -0,0 +1,51 @@ +README on the SDRAM Controller for the LH7a40X +============================================== + +The standard configuration for the SDRAM controller generates a sparse +memory array. The precise layout is determined by the SDRAM chips. A +default kernel configuration assembles the discontiguous memory +regions into separate memory nodes via the NUMA (Non-Uniform Memory +Architecture) facilities. In this default configuration, the kernel +is forgiving about the precise layout. As long as it is given an +accurate picture of available memory by the bootloader the kernel will +execute correctly. + +The SDRC supports a mode where some of the chip select lines are +swapped in order to make SDRAM look like a synchronous ROM. Setting +this bit means that the RAM will present as a contiguous array. Some +programmers prefer this to the discontiguous layout. Be aware that +may be a penalty for this feature where some some configurations of +memory are significantly reduced; i.e. 64MiB of RAM appears as only 32 +MiB. + +There are a couple of configuration options to override the default +behavior. When the SROMLL bit is set and memory appears as a +contiguous array, there is no reason to support NUMA. +CONFIG_LH7A40X_CONTIGMEM disables NUMA support. When physical memory +is discontiguous, the memory tables are organized such that there are +two banks per nodes with a small gap between them. This layout wastes +some kernel memory for page tables representing non-existent memory. +CONFIG_LH7A40X_ONE_BANK_PER_NODE optimizes the node tables such that +there are no gaps. These options control the low level organization +of the memory management tables in ways that may prevent the kernel +from booting or may cause the kernel to allocated excessively large +page tables. Be warned. Only change these options if you know what +you are doing. The default behavior is a reasonable compromise that +will suit all users. + +-- + +A typical 32MiB system with the default configuration options will +find physical memory managed as follows. + + node 0: 0xc0000000 4MiB + 0xc1000000 4MiB + node 1: 0xc4000000 4MiB + 0xc5000000 4MiB + node 2: 0xc8000000 4MiB + 0xc9000000 4MiB + node 3: 0xcc000000 4MiB + 0xcd000000 4MiB + +Setting CONFIG_LH7A40X_ONE_BANK_PER_NODE will put each bank into a +separate node. diff --git a/Documentation/arm/Sharp-LH/VectoredInterruptController b/Documentation/arm/Sharp-LH/VectoredInterruptController new file mode 100644 index 000000000000..23047e9861ee --- /dev/null +++ b/Documentation/arm/Sharp-LH/VectoredInterruptController @@ -0,0 +1,80 @@ +README on the Vectored Interrupt Controller of the LH7A404 +========================================================== + +The 404 revision of the LH7A40X series comes with two vectored +interrupts controllers. While the kernel does use some of the +features of these devices, it is far from the purpose for which they +were designed. + +When this README was written, the implementation of the VICs was in +flux. It is possible that some details, especially with priorities, +will change. + +The VIC support code is inspired by routines written by Sharp. + + +Priority Control +---------------- + +The significant reason for using the VIC's vectoring is to control +interrupt priorities. There are two tables in +arch/arm/mach-lh7a40x/irq-lh7a404.c that look something like this. + + static unsigned char irq_pri_vic1[] = { IRQ_GPIO3INTR, }; + static unsigned char irq_pri_vic2[] = { + IRQ_T3UI, IRQ_GPIO7INTR, + IRQ_UART1INTR, IRQ_UART2INTR, IRQ_UART3INTR, }; + +The initialization code reads these tables and inserts a vector +address and enable for each indicated IRQ. Vectored interrupts have +higher priority than non-vectored interrupts. So, on VIC1, +IRQ_GPIO3INTR will be served before any other non-FIQ interrupt. Due +to the way that the vectoring works, IRQ_T3UI is the next highest +priority followed by the other vectored interrupts on VIC2. After +that, the non-vectored interrupts are scanned in VIC1 then in VIC2. + + +ISR +--- + +The interrupt service routine macro get_irqnr() in +arch/arm/kernel/entry-armv.S scans the VICs for the next active +interrupt. The vectoring makes this code somewhat larger than it was +before using vectoring (refer to the LH7A400 implementation). In the +case where an interrupt is vectored, the implementation will tend to +be faster than the non-vectored version. However, the worst-case path +is longer. + +It is worth noting that at present, there is no need to read +VIC2_VECTADDR because the register appears to be shared between the +controllers. The code is written such that if this changes, it ought +to still work properly. + + +Vector Addresses +---------------- + +The proper use of the vectoring hardware would jump to the ISR +specified by the vectoring address. Linux isn't structured to take +advantage of this feature, though it might be possible to change +things to support it. + +In this implementation, the vectoring address is used to speed the +search for the active IRQ. The address is coded such that the lowest +6 bits store the IRQ number for vectored interrupts. These numbers +correspond to the bits in the interrupt status registers. IRQ zero is +the lowest interrupt bit in VIC1. IRQ 32 is the lowest interrupt bit +in VIC2. Because zero is a valid IRQ number and because we cannot +detect whether or not there is a valid vectoring address if that +address is zero, the eigth bit (0x100) is set for vectored interrupts. +The address for IRQ 0x18 (VIC2) is 0x118. Only the ninth bit is set +for the default handler on VIC1 and only the tenth bit is set for the +default handler on VIC2. + +In other words. + + 0x000 - no active interrupt + 0x1ii - vectored interrupt 0xii + 0x2xx - unvectored interrupt on VIC1 (xx is don't care) + 0x4xx - unvectored interrupt on VIC2 (xx is don't care) + diff --git a/Documentation/arm/VFP/release-notes.txt b/Documentation/arm/VFP/release-notes.txt new file mode 100644 index 000000000000..f28e0222f5e5 --- /dev/null +++ b/Documentation/arm/VFP/release-notes.txt @@ -0,0 +1,55 @@ +Release notes for Linux Kernel VFP support code +----------------------------------------------- + +Date: 20 May 2004 +Author: Russell King + +This is the first release of the Linux Kernel VFP support code. It +provides support for the exceptions bounced from VFP hardware found +on ARM926EJ-S. + +This release has been validated against the SoftFloat-2b library by +John R. Hauser using the TestFloat-2a test suite. Details of this +library and test suite can be found at: + + http://www.cs.berkeley.edu/~jhauser/arithmetic/SoftFloat.html + +The operations which have been tested with this package are: + + - fdiv + - fsub + - fadd + - fmul + - fcmp + - fcmpe + - fcvtd + - fcvts + - fsito + - ftosi + - fsqrt + +All the above pass softfloat tests with the following exceptions: + +- fadd/fsub shows some differences in the handling of +0 / -0 results + when input operands differ in signs. +- the handling of underflow exceptions is slightly different. If a + result underflows before rounding, but becomes a normalised number + after rounding, we do not signal an underflow exception. + +Other operations which have been tested by basic assembly-only tests +are: + + - fcpy + - fabs + - fneg + - ftoui + - ftosiz + - ftouiz + +The combination operations have not been tested: + + - fmac + - fnmac + - fmsc + - fnmsc + - fnmul diff --git a/Documentation/arm/empeg/README b/Documentation/arm/empeg/README new file mode 100644 index 000000000000..09cc8d03ae58 --- /dev/null +++ b/Documentation/arm/empeg/README @@ -0,0 +1,13 @@ +Empeg, Ltd's Empeg MP3 Car Audio Player + +The initial design is to go in your car, but you can use it at home, on a +boat... almost anywhere. The principle is to store CD-quality music using +MPEG technology onto a hard disk in the unit, and use the power of the +embedded computer to serve up the music you want. + +For more details, see: + + http://www.empeg.com + + + diff --git a/Documentation/arm/empeg/ir.txt b/Documentation/arm/empeg/ir.txt new file mode 100644 index 000000000000..10a297450164 --- /dev/null +++ b/Documentation/arm/empeg/ir.txt @@ -0,0 +1,49 @@ +Infra-red driver documentation. + +Mike Crowe +(C) Empeg Ltd 1999 + +Not a lot here yet :-) + +The Kenwood KCA-R6A remote control generates a sequence like the following: + +Go low for approx 16T (Around 9000us) +Go high for approx 8T (Around 4000us) +Go low for less than 2T (Around 750us) + +For each of the 32 bits + Go high for more than 2T (Around 1500us) == 1 + Go high for less than T (Around 400us) == 0 + Go low for less than 2T (Around 750us) + +Rather than repeat a signal when the button is held down certain buttons +generate the following code to indicate repetition. + +Go low for approx 16T +Go high for approx 4T +Go low for less than 2T + +(By removing the <2T from the start of the sequence and placing at the end + it can be considered a stop bit but I found it easier to deal with it at + the start). + +The 32 bits are encoded as XxYy where x and y are the actual data values +while X and Y are the logical inverses of the associated data values. Using +LSB first yields sensible codes for the numbers. + +All codes are of the form b9xx + +The numeric keys generate the code 0x where x is the number pressed. + +Tuner 1c +Tape 1d +CD 1e +CD-MD-CH 1f +Track- 0a +Track+ 0b +Rewind 0c +FF 0d +DNPP 5e +Play/Pause 0e +Vol+ 14 +Vol- 15 diff --git a/Documentation/arm/empeg/mkdevs b/Documentation/arm/empeg/mkdevs new file mode 100644 index 000000000000..7a85e28d14f3 --- /dev/null +++ b/Documentation/arm/empeg/mkdevs @@ -0,0 +1,11 @@ +#!/bin/sh +mknod /dev/display c 244 0 +mknod /dev/ir c 242 0 +mknod /dev/usb0 c 243 0 +mknod /dev/audio c 245 4 +mknod /dev/dsp c 245 3 +mknod /dev/mixer c 245 0 +mknod /dev/empeg_state c 246 0 +mknod /dev/radio0 c 81 64 +ln -sf radio0 radio +ln -sf usb0 usb diff --git a/Documentation/arm/mem_alignment b/Documentation/arm/mem_alignment new file mode 100644 index 000000000000..d145ccca169a --- /dev/null +++ b/Documentation/arm/mem_alignment @@ -0,0 +1,58 @@ +Too many problems poped up because of unnoticed misaligned memory access in +kernel code lately. Therefore the alignment fixup is now unconditionally +configured in for SA11x0 based targets. According to Alan Cox, this is a +bad idea to configure it out, but Russell King has some good reasons for +doing so on some f***ed up ARM architectures like the EBSA110. However +this is not the case on many design I'm aware of, like all SA11x0 based +ones. + +Of course this is a bad idea to rely on the alignment trap to perform +unaligned memory access in general. If those access are predictable, you +are better to use the macros provided by include/asm/unaligned.h. The +alignment trap can fixup misaligned access for the exception cases, but at +a high performance cost. It better be rare. + +Now for user space applications, it is possible to configure the alignment +trap to SIGBUS any code performing unaligned access (good for debugging bad +code), or even fixup the access by software like for kernel code. The later +mode isn't recommended for performance reasons (just think about the +floating point emulation that works about the same way). Fix your code +instead! + +Please note that randomly changing the behaviour without good thought is +real bad - it changes the behaviour of all unaligned instructions in user +space, and might cause programs to fail unexpectedly. + +To change the alignment trap behavior, simply echo a number into +/proc/sys/debug/alignment. The number is made up from various bits: + +bit behavior when set +--- ----------------- + +0 A user process performing an unaligned memory access + will cause the kernel to print a message indicating + process name, pid, pc, instruction, address, and the + fault code. + +1 The kernel will attempt to fix up the user process + performing the unaligned access. This is of course + slow (think about the floating point emulator) and + not recommended for production use. + +2 The kernel will send a SIGBUS signal to the user process + performing the unaligned access. + +Note that not all combinations are supported - only values 0 through 5. +(6 and 7 don't make sense). + +For example, the following will turn on the warnings, but without +fixing up or sending SIGBUS signals: + + echo 1 > /proc/sys/debug/alignment + +You can also read the content of the same file to get statistical +information on unaligned access occurrences plus the current mode of +operation for user space code. + + +Nicolas Pitre, Mar 13, 2001. Modified Russell King, Nov 30, 2001. diff --git a/Documentation/arm/memory.txt b/Documentation/arm/memory.txt new file mode 100644 index 000000000000..4b1c93a8177b --- /dev/null +++ b/Documentation/arm/memory.txt @@ -0,0 +1,72 @@ + Kernel Memory Layout on ARM Linux + + Russell King + May 21, 2004 (2.6.6) + +This document describes the virtual memory layout which the Linux +kernel uses for ARM processors. It indicates which regions are +free for platforms to use, and which are used by generic code. + +The ARM CPU is capable of addressing a maximum of 4GB virtual memory +space, and this must be shared between user space processes, the +kernel, and hardware devices. + +As the ARM architecture matures, it becomes necessary to reserve +certain regions of VM space for use for new facilities; therefore +this document may reserve more VM space over time. + +Start End Use +-------------------------------------------------------------------------- +ffff8000 ffffffff copy_user_page / clear_user_page use. + For SA11xx and Xscale, this is used to + setup a minicache mapping. + +ffff1000 ffff7fff Reserved. + Platforms must not use this address range. + +ffff0000 ffff0fff CPU vector page. + The CPU vectors are mapped here if the + CPU supports vector relocation (control + register V bit.) + +ffc00000 fffeffff DMA memory mapping region. Memory returned + by the dma_alloc_xxx functions will be + dynamically mapped here. + +ff000000 ffbfffff Reserved for future expansion of DMA + mapping region. + +VMALLOC_END feffffff Free for platform use, recommended. + +VMALLOC_START VMALLOC_END-1 vmalloc() / ioremap() space. + Memory returned by vmalloc/ioremap will + be dynamically placed in this region. + VMALLOC_START may be based upon the value + of the high_memory variable. + +PAGE_OFFSET high_memory-1 Kernel direct-mapped RAM region. + This maps the platforms RAM, and typically + maps all platform RAM in a 1:1 relationship. + +TASK_SIZE PAGE_OFFSET-1 Kernel module space + Kernel modules inserted via insmod are + placed here using dynamic mappings. + +00001000 TASK_SIZE-1 User space mappings + Per-thread mappings are placed here via + the mmap() system call. + +00000000 00000fff CPU vector page / null pointer trap + CPUs which do not support vector remapping + place their vector page here. NULL pointer + dereferences by both the kernel and user + space are also caught via this mapping. + +Please note that mappings which collide with the above areas may result +in a non-bootable kernel, or may cause the kernel to (eventually) panic +at run time. + +Since future CPUs may impact the kernel mapping layout, user programs +must not access any memory which is not mapped inside their 0x0001000 +to TASK_SIZE address range. If they wish to access these areas, they +must set up their own mappings using open() and mmap(). diff --git a/Documentation/arm/nwfpe/NOTES b/Documentation/arm/nwfpe/NOTES new file mode 100644 index 000000000000..40577b5a49d3 --- /dev/null +++ b/Documentation/arm/nwfpe/NOTES @@ -0,0 +1,29 @@ +There seems to be a problem with exp(double) and our emulator. I haven't +been able to track it down yet. This does not occur with the emulator +supplied by Russell King. + +I also found one oddity in the emulator. I don't think it is serious but +will point it out. The ARM calling conventions require floating point +registers f4-f7 to be preserved over a function call. The compiler quite +often uses an stfe instruction to save f4 on the stack upon entry to a +function, and an ldfe instruction to restore it before returning. + +I was looking at some code, that calculated a double result, stored it in f4 +then made a function call. Upon return from the function call the number in +f4 had been converted to an extended value in the emulator. + +This is a side effect of the stfe instruction. The double in f4 had to be +converted to extended, then stored. If an lfm/sfm combination had been used, +then no conversion would occur. This has performance considerations. The +result from the function call and f4 were used in a multiplication. If the +emulator sees a multiply of a double and extended, it promotes the double to +extended, then does the multiply in extended precision. + +This code will cause this problem: + +double x, y, z; +z = log(x)/log(y); + +The result of log(x) (a double) will be calculated, returned in f0, then +moved to f4 to preserve it over the log(y) call. The division will be done +in extended precision, due to the stfe instruction used to save f4 in log(y). diff --git a/Documentation/arm/nwfpe/README b/Documentation/arm/nwfpe/README new file mode 100644 index 000000000000..771871de0c8b --- /dev/null +++ b/Documentation/arm/nwfpe/README @@ -0,0 +1,70 @@ +This directory contains the version 0.92 test release of the NetWinder +Floating Point Emulator. + +The majority of the code was written by me, Scott Bambrough It is +written in C, with a small number of routines in inline assembler +where required. It was written quickly, with a goal of implementing a +working version of all the floating point instructions the compiler +emits as the first target. I have attempted to be as optimal as +possible, but there remains much room for improvement. + +I have attempted to make the emulator as portable as possible. One of +the problems is with leading underscores on kernel symbols. Elf +kernels have no leading underscores, a.out compiled kernels do. I +have attempted to use the C_SYMBOL_NAME macro wherever this may be +important. + +Another choice I made was in the file structure. I have attempted to +contain all operating system specific code in one module (fpmodule.*). +All the other files contain emulator specific code. This should allow +others to port the emulator to NetBSD for instance relatively easily. + +The floating point operations are based on SoftFloat Release 2, by +John Hauser. SoftFloat is a software implementation of floating-point +that conforms to the IEC/IEEE Standard for Binary Floating-point +Arithmetic. As many as four formats are supported: single precision, +double precision, extended double precision, and quadruple precision. +All operations required by the standard are implemented, except for +conversions to and from decimal. We use only the single precision, +double precision and extended double precision formats. The port of +SoftFloat to the ARM was done by Phil Blundell, based on an earlier +port of SoftFloat version 1 by Neil Carson for NetBSD/arm32. + +The file README.FPE contains a description of what has been implemented +so far in the emulator. The file TODO contains a information on what +remains to be done, and other ideas for the emulator. + +Bug reports, comments, suggestions should be directed to me at +. General reports of "this program doesn't +work correctly when your emulator is installed" are useful for +determining that bugs still exist; but are virtually useless when +attempting to isolate the problem. Please report them, but don't +expect quick action. Bugs still exist. The problem remains in isolating +which instruction contains the bug. Small programs illustrating a specific +problem are a godsend. + +Legal Notices +------------- + +The NetWinder Floating Point Emulator is free software. Everything Rebel.com +has written is provided under the GNU GPL. See the file COPYING for copying +conditions. Excluded from the above is the SoftFloat code. John Hauser's +legal notice for SoftFloat is included below. + +------------------------------------------------------------------------------- +SoftFloat Legal Notice + +SoftFloat was written by John R. Hauser. This work was made possible in +part by the International Computer Science Institute, located at Suite 600, +1947 Center Street, Berkeley, California 94704. Funding was partially +provided by the National Science Foundation under grant MIP-9311980. The +original version of this code was written as part of a project to build +a fixed-point vector processor in collaboration with the University of +California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek. + +THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort +has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT +TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO +PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY +AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE. +------------------------------------------------------------------------------- diff --git a/Documentation/arm/nwfpe/README.FPE b/Documentation/arm/nwfpe/README.FPE new file mode 100644 index 000000000000..26f5d7bb9a41 --- /dev/null +++ b/Documentation/arm/nwfpe/README.FPE @@ -0,0 +1,156 @@ +The following describes the current state of the NetWinder's floating point +emulator. + +In the following nomenclature is used to describe the floating point +instructions. It follows the conventions in the ARM manual. + + = , no default +{P|M|Z} = {round to +infinity,round to -infinity,round to zero}, + default = round to nearest + +Note: items enclosed in {} are optional. + +Floating Point Coprocessor Data Transfer Instructions (CPDT) +------------------------------------------------------------ + +LDF/STF - load and store floating + +{cond} Fd, Rn +{cond} Fd, [Rn, #]{!} +{cond} Fd, [Rn], # + +These instructions are fully implemented. + +LFM/SFM - load and store multiple floating + +Form 1 syntax: +{cond} Fd, , [Rn] +{cond} Fd, , [Rn, #]{!} +{cond} Fd, , [Rn], # + +Form 2 syntax: +{cond} Fd, , [Rn]{!} + +These instructions are fully implemented. They store/load three words +for each floating point register into the memory location given in the +instruction. The format in memory is unlikely to be compatible with +other implementations, in particular the actual hardware. Specific +mention of this is made in the ARM manuals. + +Floating Point Coprocessor Register Transfer Instructions (CPRT) +---------------------------------------------------------------- + +Conversions, read/write status/control register instructions + +FLT{cond}{P,M,Z} Fn, Rd Convert integer to floating point +FIX{cond}{P,M,Z} Rd, Fn Convert floating point to integer +WFS{cond} Rd Write floating point status register +RFS{cond} Rd Read floating point status register +WFC{cond} Rd Write floating point control register +RFC{cond} Rd Read floating point control register + +FLT/FIX are fully implemented. + +RFS/WFS are fully implemented. + +RFC/WFC are fully implemented. RFC/WFC are supervisor only instructions, and +presently check the CPU mode, and do an invalid instruction trap if not called +from supervisor mode. + +Compare instructions + +CMF{cond} Fn, Fm Compare floating +CMFE{cond} Fn, Fm Compare floating with exception +CNF{cond} Fn, Fm Compare negated floating +CNFE{cond} Fn, Fm Compare negated floating with exception + +These are fully implemented. + +Floating Point Coprocessor Data Instructions (CPDT) +--------------------------------------------------- + +Dyadic operations: + +ADF{cond}{P,M,Z} Fd, Fn, - add +SUF{cond}{P,M,Z} Fd, Fn, - subtract +RSF{cond}{P,M,Z} Fd, Fn, - reverse subtract +MUF{cond}{P,M,Z} Fd, Fn, - multiply +DVF{cond}{P,M,Z} Fd, Fn, - divide +RDV{cond}{P,M,Z} Fd, Fn, - reverse divide + +These are fully implemented. + +FML{cond}{P,M,Z} Fd, Fn, - fast multiply +FDV{cond}{P,M,Z} Fd, Fn, - fast divide +FRD{cond}{P,M,Z} Fd, Fn, - fast reverse divide + +These are fully implemented as well. They use the same algorithm as the +non-fast versions. Hence, in this implementation their performance is +equivalent to the MUF/DVF/RDV instructions. This is acceptable according +to the ARM manual. The manual notes these are defined only for single +operands, on the actual FPA11 hardware they do not work for double or +extended precision operands. The emulator currently does not check +the requested permissions conditions, and performs the requested operation. + +RMF{cond}{P,M,Z} Fd, Fn, - IEEE remainder + +This is fully implemented. + +Monadic operations: + +MVF{cond}{P,M,Z} Fd, - move +MNF{cond}{P,M,Z} Fd, - move negated + +These are fully implemented. + +ABS{cond}{P,M,Z} Fd, - absolute value +SQT{cond}{P,M,Z} Fd, - square root +RND{cond}{P,M,Z} Fd, - round + +These are fully implemented. + +URD{cond}{P,M,Z} Fd, - unnormalized round +NRM{cond}{P,M,Z} Fd, - normalize + +These are implemented. URD is implemented using the same code as the RND +instruction. Since URD cannot return a unnormalized number, NRM becomes +a NOP. + +Library calls: + +POW{cond}{P,M,Z} Fd, Fn, - power +RPW{cond}{P,M,Z} Fd, Fn, - reverse power +POL{cond}{P,M,Z} Fd, Fn, - polar angle (arctan2) + +LOG{cond}{P,M,Z} Fd, - logarithm to base 10 +LGN{cond}{P,M,Z} Fd, - logarithm to base e +EXP{cond}{P,M,Z} Fd, - exponent +SIN{cond}{P,M,Z} Fd, - sine +COS{cond}{P,M,Z} Fd, - cosine +TAN{cond}{P,M,Z} Fd, - tangent +ASN{cond}{P,M,Z} Fd, - arcsine +ACS{cond}{P,M,Z} Fd, - arccosine +ATN{cond}{P,M,Z} Fd, - arctangent + +These are not implemented. They are not currently issued by the compiler, +and are handled by routines in libc. These are not implemented by the FPA11 +hardware, but are handled by the floating point support code. They should +be implemented in future versions. + +Signalling: + +Signals are implemented. However current ELF kernels produced by Rebel.com +have a bug in them that prevents the module from generating a SIGFPE. This +is caused by a failure to alias fp_current to the kernel variable +current_set[0] correctly. + +The kernel provided with this distribution (vmlinux-nwfpe-0.93) contains +a fix for this problem and also incorporates the current version of the +emulator directly. It is possible to run with no floating point module +loaded with this kernel. It is provided as a demonstration of the +technology and for those who want to do floating point work that depends +on signals. It is not strictly necessary to use the module. + +A module (either the one provided by Russell King, or the one in this +distribution) can be loaded to replace the functionality of the emulator +built into the kernel. diff --git a/Documentation/arm/nwfpe/TODO b/Documentation/arm/nwfpe/TODO new file mode 100644 index 000000000000..8027061b60eb --- /dev/null +++ b/Documentation/arm/nwfpe/TODO @@ -0,0 +1,67 @@ +TODO LIST +--------- + +POW{cond}{P,M,Z} Fd, Fn, - power +RPW{cond}{P,M,Z} Fd, Fn, - reverse power +POL{cond}{P,M,Z} Fd, Fn, - polar angle (arctan2) + +LOG{cond}{P,M,Z} Fd, - logarithm to base 10 +LGN{cond}{P,M,Z} Fd, - logarithm to base e +EXP{cond}{P,M,Z} Fd, - exponent +SIN{cond}{P,M,Z} Fd, - sine +COS{cond}{P,M,Z} Fd, - cosine +TAN{cond}{P,M,Z} Fd, - tangent +ASN{cond}{P,M,Z} Fd, - arcsine +ACS{cond}{P,M,Z} Fd, - arccosine +ATN{cond}{P,M,Z} Fd, - arctangent + +These are not implemented. They are not currently issued by the compiler, +and are handled by routines in libc. These are not implemented by the FPA11 +hardware, but are handled by the floating point support code. They should +be implemented in future versions. + +There are a couple of ways to approach the implementation of these. One +method would be to use accurate table methods for these routines. I have +a couple of papers by S. Gal from IBM's research labs in Haifa, Israel that +seem to promise extreme accuracy (in the order of 99.8%) and reasonable speed. +These methods are used in GLIBC for some of the transcendental functions. + +Another approach, which I know little about is CORDIC. This stands for +Coordinate Rotation Digital Computer, and is a method of computing +transcendental functions using mostly shifts and adds and a few +multiplications and divisions. The ARM excels at shifts and adds, +so such a method could be promising, but requires more research to +determine if it is feasible. + +Rounding Methods + +The IEEE standard defines 4 rounding modes. Round to nearest is the +default, but rounding to + or - infinity or round to zero are also allowed. +Many architectures allow the rounding mode to be specified by modifying bits +in a control register. Not so with the ARM FPA11 architecture. To change +the rounding mode one must specify it with each instruction. + +This has made porting some benchmarks difficult. It is possible to +introduce such a capability into the emulator. The FPCR contains +bits describing the rounding mode. The emulator could be altered to +examine a flag, which if set forced it to ignore the rounding mode in +the instruction, and use the mode specified in the bits in the FPCR. + +This would require a method of getting/setting the flag, and the bits +in the FPCR. This requires a kernel call in ArmLinux, as WFC/RFC are +supervisor only instructions. If anyone has any ideas or comments I +would like to hear them. + +[NOTE: pulled out from some docs on ARM floating point, specifically + for the Acorn FPE, but not limited to it: + + The floating point control register (FPCR) may only be present in some + implementations: it is there to control the hardware in an implementation- + specific manner, for example to disable the floating point system. The user + mode of the ARM is not permitted to use this register (since the right is + reserved to alter it between implementations) and the WFC and RFC + instructions will trap if tried in user mode. + + Hence, the answer is yes, you could do this, but then you will run a high + risk of becoming isolated if and when hardware FP emulation comes out + -- Russell]. diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt new file mode 100644 index 000000000000..8eedaa24f5e2 --- /dev/null +++ b/Documentation/atomic_ops.txt @@ -0,0 +1,456 @@ + Semantics and Behavior of Atomic and + Bitmask Operations + + David S. Miller + + This document is intended to serve as a guide to Linux port +maintainers on how to implement atomic counter, bitops, and spinlock +interfaces properly. + + The atomic_t type should be defined as a signed integer. +Also, it should be made opaque such that any kind of cast to a normal +C integer type will fail. Something like the following should +suffice: + + typedef struct { volatile int counter; } atomic_t; + + The first operations to implement for atomic_t's are the +initializers and plain reads. + + #define ATOMIC_INIT(i) { (i) } + #define atomic_set(v, i) ((v)->counter = (i)) + +The first macro is used in definitions, such as: + +static atomic_t my_counter = ATOMIC_INIT(1); + +The second interface can be used at runtime, as in: + + struct foo { atomic_t counter; }; + ... + + struct foo *k; + + k = kmalloc(sizeof(*k), GFP_KERNEL); + if (!k) + return -ENOMEM; + atomic_set(&k->counter, 0); + +Next, we have: + + #define atomic_read(v) ((v)->counter) + +which simply reads the current value of the counter. + +Now, we move onto the actual atomic operation interfaces. + + void atomic_add(int i, atomic_t *v); + void atomic_sub(int i, atomic_t *v); + void atomic_inc(atomic_t *v); + void atomic_dec(atomic_t *v); + +These four routines add and subtract integral values to/from the given +atomic_t value. The first two routines pass explicit integers by +which to make the adjustment, whereas the latter two use an implicit +adjustment value of "1". + +One very important aspect of these two routines is that they DO NOT +require any explicit memory barriers. They need only perform the +atomic_t counter update in an SMP safe manner. + +Next, we have: + + int atomic_inc_return(atomic_t *v); + int atomic_dec_return(atomic_t *v); + +These routines add 1 and subtract 1, respectively, from the given +atomic_t and return the new counter value after the operation is +performed. + +Unlike the above routines, it is required that explicit memory +barriers are performed before and after the operation. It must be +done such that all memory operations before and after the atomic +operation calls are strongly ordered with respect to the atomic +operation itself. + +For example, it should behave as if a smp_mb() call existed both +before and after the atomic operation. + +If the atomic instructions used in an implementation provide explicit +memory barrier semantics which satisfy the above requirements, that is +fine as well. + +Let's move on: + + int atomic_add_return(int i, atomic_t *v); + int atomic_sub_return(int i, atomic_t *v); + +These behave just like atomic_{inc,dec}_return() except that an +explicit counter adjustment is given instead of the implicit "1". +This means that like atomic_{inc,dec}_return(), the memory barrier +semantics are required. + +Next: + + int atomic_inc_and_test(atomic_t *v); + int atomic_dec_and_test(atomic_t *v); + +These two routines increment and decrement by 1, respectively, the +given atomic counter. They return a boolean indicating whether the +resulting counter value was zero or not. + +It requires explicit memory barrier semantics around the operation as +above. + + int atomic_sub_and_test(int i, atomic_t *v); + +This is identical to atomic_dec_and_test() except that an explicit +decrement is given instead of the implicit "1". It requires explicit +memory barrier semantics around the operation. + + int atomic_add_negative(int i, atomic_t *v); + +The given increment is added to the given atomic counter value. A +boolean is return which indicates whether the resulting counter value +is negative. It requires explicit memory barrier semantics around the +operation. + +If a caller requires memory barrier semantics around an atomic_t +operation which does not return a value, a set of interfaces are +defined which accomplish this: + + void smp_mb__before_atomic_dec(void); + void smp_mb__after_atomic_dec(void); + void smp_mb__before_atomic_inc(void); + void smp_mb__after_atomic_dec(void); + +For example, smp_mb__before_atomic_dec() can be used like so: + + obj->dead = 1; + smp_mb__before_atomic_dec(); + atomic_dec(&obj->ref_count); + +It makes sure that all memory operations preceeding the atomic_dec() +call are strongly ordered with respect to the atomic counter +operation. In the above example, it guarentees that the assignment of +"1" to obj->dead will be globally visible to other cpus before the +atomic counter decrement. + +Without the explicitl smp_mb__before_atomic_dec() call, the +implementation could legally allow the atomic counter update visible +to other cpus before the "obj->dead = 1;" assignment. + +The other three interfaces listed are used to provide explicit +ordering with respect to memory operations after an atomic_dec() call +(smp_mb__after_atomic_dec()) and around atomic_inc() calls +(smp_mb__{before,after}_atomic_inc()). + +A missing memory barrier in the cases where they are required by the +atomic_t implementation above can have disasterous results. Here is +an example, which follows a pattern occuring frequently in the Linux +kernel. It is the use of atomic counters to implement reference +counting, and it works such that once the counter falls to zero it can +be guarenteed that no other entity can be accessing the object: + +static void obj_list_add(struct obj *obj) +{ + obj->active = 1; + list_add(&obj->list); +} + +static void obj_list_del(struct obj *obj) +{ + list_del(&obj->list); + obj->active = 0; +} + +static void obj_destroy(struct obj *obj) +{ + BUG_ON(obj->active); + kfree(obj); +} + +struct obj *obj_list_peek(struct list_head *head) +{ + if (!list_empty(head)) { + struct obj *obj; + + obj = list_entry(head->next, struct obj, list); + atomic_inc(&obj->refcnt); + return obj; + } + return NULL; +} + +void obj_poke(void) +{ + struct obj *obj; + + spin_lock(&global_list_lock); + obj = obj_list_peek(&global_list); + spin_unlock(&global_list_lock); + + if (obj) { + obj->ops->poke(obj); + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); + } +} + +void obj_timeout(struct obj *obj) +{ + spin_lock(&global_list_lock); + obj_list_del(obj); + spin_unlock(&global_list_lock); + + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); +} + +(This is a simplification of the ARP queue management in the + generic neighbour discover code of the networking. Olaf Kirch + found a bug wrt. memory barriers in kfree_skb() that exposed + the atomic_t memory barrier requirements quite clearly.) + +Given the above scheme, it must be the case that the obj->active +update done by the obj list deletion be visible to other processors +before the atomic counter decrement is performed. + +Otherwise, the counter could fall to zero, yet obj->active would still +be set, thus triggering the assertion in obj_destroy(). The error +sequence looks like this: + + cpu 0 cpu 1 + obj_poke() obj_timeout() + obj = obj_list_peek(); + ... gains ref to obj, refcnt=2 + obj_list_del(obj); + obj->active = 0 ... + ... visibility delayed ... + atomic_dec_and_test() + ... refcnt drops to 1 ... + atomic_dec_and_test() + ... refcount drops to 0 ... + obj_destroy() + BUG() triggers since obj->active + still seen as one + obj->active update visibility occurs + +With the memory barrier semantics required of the atomic_t operations +which return values, the above sequence of memory visibility can never +happen. Specifically, in the above case the atomic_dec_and_test() +counter decrement would not become globally visible until the +obj->active update does. + +As a historical note, 32-bit Sparc used to only allow usage of +24-bits of it's atomic_t type. This was because it used 8 bits +as a spinlock for SMP safety. Sparc32 lacked a "compare and swap" +type instruction. However, 32-bit Sparc has since been moved over +to a "hash table of spinlocks" scheme, that allows the full 32-bit +counter to be realized. Essentially, an array of spinlocks are +indexed into based upon the address of the atomic_t being operated +on, and that lock protects the atomic operation. Parisc uses the +same scheme. + +Another note is that the atomic_t operations returning values are +extremely slow on an old 386. + +We will now cover the atomic bitmask operations. You will find that +their SMP and memory barrier semantics are similar in shape and scope +to the atomic_t ops above. + +Native atomic bit operations are defined to operate on objects aligned +to the size of an "unsigned long" C data type, and are least of that +size. The endianness of the bits within each "unsigned long" are the +native endianness of the cpu. + + void set_bit(unsigned long nr, volatils unsigned long *addr); + void clear_bit(unsigned long nr, volatils unsigned long *addr); + void change_bit(unsigned long nr, volatils unsigned long *addr); + +These routines set, clear, and change, respectively, the bit number +indicated by "nr" on the bit mask pointed to by "ADDR". + +They must execute atomically, yet there are no implicit memory barrier +semantics required of these interfaces. + + int test_and_set_bit(unsigned long nr, volatils unsigned long *addr); + int test_and_clear_bit(unsigned long nr, volatils unsigned long *addr); + int test_and_change_bit(unsigned long nr, volatils unsigned long *addr); + +Like the above, except that these routines return a boolean which +indicates whether the changed bit was set _BEFORE_ the atomic bit +operation. + +WARNING! It is incredibly important that the value be a boolean, +ie. "0" or "1". Do not try to be fancy and save a few instructions by +declaring the above to return "long" and just returning something like +"old_val & mask" because that will not work. + +For one thing, this return value gets truncated to int in many code +paths using these interfaces, so on 64-bit if the bit is set in the +upper 32-bits then testers will never see that. + +One great example of where this problem crops up are the thread_info +flag operations. Routines such as test_and_set_ti_thread_flag() chop +the return value into an int. There are other places where things +like this occur as well. + +These routines, like the atomic_t counter operations returning values, +require explicit memory barrier semantics around their execution. All +memory operations before the atomic bit operation call must be made +visible globally before the atomic bit operation is made visible. +Likewise, the atomic bit operation must be visible globally before any +subsequent memory operation is made visible. For example: + + obj->dead = 1; + if (test_and_set_bit(0, &obj->flags)) + /* ... */; + obj->killed = 1; + +The implementation of test_and_set_bit() must guarentee that +"obj->dead = 1;" is visible to cpus before the atomic memory operation +done by test_and_set_bit() becomes visible. Likewise, the atomic +memory operation done by test_and_set_bit() must become visible before +"obj->killed = 1;" is visible. + +Finally there is the basic operation: + + int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); + +Which returns a boolean indicating if bit "nr" is set in the bitmask +pointed to by "addr". + +If explicit memory barriers are required around clear_bit() (which +does not return a value, and thus does not need to provide memory +barrier semantics), two interfaces are provided: + + void smp_mb__before_clear_bit(void); + void smp_mb__after_clear_bit(void); + +They are used as follows, and are akin to their atomic_t operation +brothers: + + /* All memory operations before this call will + * be globally visible before the clear_bit(). + */ + smp_mb__before_clear_bit(); + clear_bit( ... ); + + /* The clear_bit() will be visible before all + * subsequent memory operations. + */ + smp_mb__after_clear_bit(); + +Finally, there are non-atomic versions of the bitmask operations +provided. They are used in contexts where some other higher-level SMP +locking scheme is being used to protect the bitmask, and thus less +expensive non-atomic operations may be used in the implementation. +They have names similar to the above bitmask operation interfaces, +except that two underscores are prefixed to the interface name. + + void __set_bit(unsigned long nr, volatile unsigned long *addr); + void __clear_bit(unsigned long nr, volatile unsigned long *addr); + void __change_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); + +These non-atomic variants also do not require any special memory +barrier semantics. + +The routines xchg() and cmpxchg() need the same exact memory barriers +as the atomic and bit operations returning values. + +Spinlocks and rwlocks have memory barrier expectations as well. +The rule to follow is simple: + +1) When acquiring a lock, the implementation must make it globally + visible before any subsequent memory operation. + +2) When releasing a lock, the implementation must make it such that + all previous memory operations are globally visible before the + lock release. + +Which finally brings us to _atomic_dec_and_lock(). There is an +architecture-neutral version implemented in lib/dec_and_lock.c, +but most platforms will wish to optimize this in assembler. + + int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); + +Atomically decrement the given counter, and if will drop to zero +atomically acquire the given spinlock and perform the decrement +of the counter to zero. If it does not drop to zero, do nothing +with the spinlock. + +It is actually pretty simple to get the memory barrier correct. +Simply satisfy the spinlock grab requirements, which is make +sure the spinlock operation is globally visible before any +subsequent memory operation. + +We can demonstrate this operation more clearly if we define +an abstract atomic operation: + + long cas(long *mem, long old, long new); + +"cas" stands for "compare and swap". It atomically: + +1) Compares "old" with the value currently at "mem". +2) If they are equal, "new" is written to "mem". +3) Regardless, the current value at "mem" is returned. + +As an example usage, here is what an atomic counter update +might look like: + +void example_atomic_inc(long *counter) +{ + long old, new, ret; + + while (1) { + old = *counter; + new = old + 1; + + ret = cas(counter, old, new); + if (ret == old) + break; + } +} + +Let's use cas() in order to build a pseudo-C atomic_dec_and_lock(): + +int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) +{ + long old, new, ret; + int went_to_zero; + + went_to_zero = 0; + while (1) { + old = atomic_read(atomic); + new = old - 1; + if (new == 0) { + went_to_zero = 1; + spin_lock(lock); + } + ret = cas(atomic, old, new); + if (ret == old) + break; + if (went_to_zero) { + spin_unlock(lock); + went_to_zero = 0; + } + } + + return went_to_zero; +} + +Now, as far as memory barriers go, as long as spin_lock() +strictly orders all subsequent memory operations (including +the cas()) with respect to itself, things will be fine. + +Said another way, _atomic_dec_and_lock() must guarentee that +a counter dropping to zero is never made visible before the +spinlock being acquired. + +Note that this also means that for the case where the counter +is not dropping to zero, there are no memory ordering +requirements. diff --git a/Documentation/basic_profiling.txt b/Documentation/basic_profiling.txt new file mode 100644 index 000000000000..65e3dc2d4437 --- /dev/null +++ b/Documentation/basic_profiling.txt @@ -0,0 +1,52 @@ +These instructions are deliberately very basic. If you want something clever, +go read the real docs ;-) Please don't add more stuff, but feel free to +correct my mistakes ;-) (mbligh@aracnet.com) +Thanks to John Levon, Dave Hansen, et al. for help writing this. + + is the thing you're trying to measure. +Make sure you have the correct System.map / vmlinux referenced! + +It is probably easiest to use "make install" for linux and hack +/sbin/installkernel to copy vmlinux to /boot, in addition to vmlinuz, +config, System.map, which are usually installed by default. + +Readprofile +----------- +A recent readprofile command is needed for 2.6, such as found in util-linux +2.12a, which can be downloaded from: + +http://www.kernel.org/pub/linux/utils/util-linux/ + +Most distributions will ship it already. + +Add "profile=2" to the kernel command line. + +clear readprofile -r + +dump output readprofile -m /boot/System.map > captured_profile + +Oprofile +-------- +Get the source (I use 0.8) from http://oprofile.sourceforge.net/ +and add "idle=poll" to the kernel command line +Configure with CONFIG_PROFILING=y and CONFIG_OPROFILE=y & reboot on new kernel +./configure --with-kernel-support +make install + +For superior results, be sure to enable the local APIC. If opreport sees +a 0Hz CPU, APIC was not on. Be aware that idle=poll may mean a performance +penalty. + +One time setup: + opcontrol --setup --vmlinux=/boot/vmlinux + +clear opcontrol --reset +start opcontrol --start + +stop opcontrol --stop +dump output opreport > output_file + +To only report on the kernel, run opreport /boot/vmlinux > output_file + +A reset is needed to clear old statistics, which survive a reboot. + diff --git a/Documentation/binfmt_misc.txt b/Documentation/binfmt_misc.txt new file mode 100644 index 000000000000..d097f09ee15a --- /dev/null +++ b/Documentation/binfmt_misc.txt @@ -0,0 +1,116 @@ + Kernel Support for miscellaneous (your favourite) Binary Formats v1.1 + ===================================================================== + +This Kernel feature allows you to invoke almost (for restrictions see below) +every program by simply typing its name in the shell. +This includes for example compiled Java(TM), Python or Emacs programs. + +To achieve this you must tell binfmt_misc which interpreter has to be invoked +with which binary. Binfmt_misc recognises the binary-type by matching some bytes +at the beginning of the file with a magic byte sequence (masking out specified +bits) you have supplied. Binfmt_misc can also recognise a filename extension +aka '.com' or '.exe'. + +First you must mount binfmt_misc: + mount binfmt_misc -t binfmt_misc /proc/sys/fs/binfmt_misc + +To actually register a new binary type, you have to set up a string looking like +:name:type:offset:magic:mask:interpreter:flags (where you can choose the ':' upon +your needs) and echo it to /proc/sys/fs/binfmt_misc/register. +Here is what the fields mean: + - 'name' is an identifier string. A new /proc file will be created with this + name below /proc/sys/fs/binfmt_misc + - 'type' is the type of recognition. Give 'M' for magic and 'E' for extension. + - 'offset' is the offset of the magic/mask in the file, counted in bytes. This + defaults to 0 if you omit it (i.e. you write ':name:type::magic...') + - 'magic' is the byte sequence binfmt_misc is matching for. The magic string + may contain hex-encoded characters like \x0a or \xA4. In a shell environment + you will have to write \\x0a to prevent the shell from eating your \. + If you chose filename extension matching, this is the extension to be + recognised (without the '.', the \x0a specials are not allowed). Extension + matching is case sensitive! + - 'mask' is an (optional, defaults to all 0xff) mask. You can mask out some + bits from matching by supplying a string like magic and as long as magic. + The mask is anded with the byte sequence of the file. + - 'interpreter' is the program that should be invoked with the binary as first + argument (specify the full path) + - 'flags' is an optional field that controls several aspects of the invocation + of the interpreter. It is a string of capital letters, each controls a certain + aspect. The following flags are supported - + 'P' - preserve-argv[0]. Legacy behavior of binfmt_misc is to overwrite the + original argv[0] with the full path to the binary. When this flag is + included, binfmt_misc will add an argument to the argument vector for + this purpose, thus preserving the original argv[0]. + 'O' - open-binary. Legacy behavior of binfmt_misc is to pass the full path + of the binary to the interpreter as an argument. When this flag is + included, binfmt_misc will open the file for reading and pass its + descriptor as an argument, instead of the full path, thus allowing + the interpreter to execute non-readable binaries. This feature should + be used with care - the interpreter has to be trusted not to emit + the contents of the non-readable binary. + 'C' - credentials. Currently, the behavior of binfmt_misc is to calculate + the credentials and security token of the new process according to + the interpreter. When this flag is included, these attributes are + calculated according to the binary. It also implies the 'O' flag. + This feature should be used with care as the interpreter + will run with root permissions when a setuid binary owned by root + is run with binfmt_misc. + + +There are some restrictions: + - the whole register string may not exceed 255 characters + - the magic must reside in the first 128 bytes of the file, i.e. + offset+size(magic) has to be less than 128 + - the interpreter string may not exceed 127 characters + +To use binfmt_misc you have to mount it first. You can mount it with +"mount -t binfmt_misc none /proc/sys/fs/binfmt_misc" command, or you can add +a line "none /proc/sys/fs/binfmt_misc binfmt_misc defaults 0 0" to your +/etc/fstab so it auto mounts on boot. + +You may want to add the binary formats in one of your /etc/rc scripts during +boot-up. Read the manual of your init program to figure out how to do this +right. + +Think about the order of adding entries! Later added entries are matched first! + + +A few examples (assumed you are in /proc/sys/fs/binfmt_misc): + +- enable support for em86 (like binfmt_em86, for Alpha AXP only): + echo ':i386:M::\x7fELF\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x03:\xff\xff\xff\xff\xff\xfe\xfe\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfb\xff\xff:/bin/em86:' > register + echo ':i486:M::\x7fELF\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x06:\xff\xff\xff\xff\xff\xfe\xfe\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfb\xff\xff:/bin/em86:' > register + +- enable support for packed DOS applications (pre-configured dosemu hdimages): + echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register + +- enable support for Windows executables using wine: + echo ':DOSWin:M::MZ::/usr/local/bin/wine:' > register + +For java support see Documentation/java.txt + + +You can enable/disable binfmt_misc or one binary type by echoing 0 (to disable) +or 1 (to enable) to /proc/sys/fs/binfmt_misc/status or /proc/.../the_name. +Catting the file tells you the current status of binfmt_misc/the entry. + +You can remove one entry or all entries by echoing -1 to /proc/.../the_name +or /proc/sys/fs/binfmt_misc/status. + + +HINTS: +====== + +If you want to pass special arguments to your interpreter, you can +write a wrapper script for it. See Documentation/java.txt for an +example. + +Your interpreter should NOT look in the PATH for the filename; the kernel +passes it the full filename (or the file descriptor) to use. Using $PATH can +cause unexpected behaviour and can be a security hazard. + + +There is a web page about binfmt_misc at +http://www.tat.physik.uni-tuebingen.de/~rguenth/linux/binfmt_misc.html + +Richard Günther diff --git a/Documentation/block/as-iosched.txt b/Documentation/block/as-iosched.txt new file mode 100644 index 000000000000..6f47332c883d --- /dev/null +++ b/Documentation/block/as-iosched.txt @@ -0,0 +1,165 @@ +Anticipatory IO scheduler +------------------------- +Nick Piggin 13 Sep 2003 + +Attention! Database servers, especially those using "TCQ" disks should +investigate performance with the 'deadline' IO scheduler. Any system with high +disk performance requirements should do so, in fact. + +If you see unusual performance characteristics of your disk systems, or you +see big performance regressions versus the deadline scheduler, please email +me. Database users don't bother unless you're willing to test a lot of patches +from me ;) its a known issue. + +Also, users with hardware RAID controllers, doing striping, may find +highly variable performance results with using the as-iosched. The +as-iosched anticipatory implementation is based on the notion that a disk +device has only one physical seeking head. A striped RAID controller +actually has a head for each physical device in the logical RAID device. + +However, setting the antic_expire (see tunable parameters below) produces +very similar behavior to the deadline IO scheduler. + + +Selecting IO schedulers +----------------------- +To choose IO schedulers at boot time, use the argument 'elevator=deadline'. +'noop' and 'as' (the default) are also available. IO schedulers are assigned +globally at boot time only presently. + + +Anticipatory IO scheduler Policies +---------------------------------- +The as-iosched implementation implements several layers of policies +to determine when an IO request is dispatched to the disk controller. +Here are the policies outlined, in order of application. + +1. one-way Elevator algorithm. + +The elevator algorithm is similar to that used in deadline scheduler, with +the addition that it allows limited backward movement of the elevator +(i.e. seeks backwards). A seek backwards can occur when choosing between +two IO requests where one is behind the elevator's current position, and +the other is in front of the elevator's position. If the seek distance to +the request in back of the elevator is less than half the seek distance to +the request in front of the elevator, then the request in back can be chosen. +Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors. +This favors forward movement of the elevator, while allowing opportunistic +"short" backward seeks. + +2. FIFO expiration times for reads and for writes. + +This is again very similar to the deadline IO scheduler. The expiration +times for requests on these lists is tunable using the parameters read_expire +and write_expire discussed below. When a read or a write expires in this way, +the IO scheduler will interrupt its current elevator sweep or read anticipation +to service the expired request. + +3. Read and write request batching + +A batch is a collection of read requests or a collection of write +requests. The as scheduler alternates dispatching read and write batches +to the driver. In the case a read batch, the scheduler submits read +requests to the driver as long as there are read requests to submit, and +the read batch time limit has not been exceeded (read_batch_expire). +The read batch time limit begins counting down only when there are +competing write requests pending. + +In the case of a write batch, the scheduler submits write requests to +the driver as long as there are write requests available, and the +write batch time limit has not been exceeded (write_batch_expire). +However, the length of write batches will be gradually shortened +when read batches frequently exceed their time limit. + +When changing between batch types, the scheduler waits for all requests +from the previous batch to complete before scheduling requests for the +next batch. + +The read and write fifo expiration times described in policy 2 above +are checked only when in scheduling IO of a batch for the corresponding +(read/write) type. So for example, the read FIFO timeout values are +tested only during read batches. Likewise, the write FIFO timeout +values are tested only during write batches. For this reason, +it is generally not recommended for the read batch time +to be longer than the write expiration time, nor for the write batch +time to exceed the read expiration time (see tunable parameters below). + +When the IO scheduler changes from a read to a write batch, +it begins the elevator from the request that is on the head of the +write expiration FIFO. Likewise, when changing from a write batch to +a read batch, scheduler begins the elevator from the first entry +on the read expiration FIFO. + +4. Read anticipation. + +Read anticipation occurs only when scheduling a read batch. +This implementation of read anticipation allows only one read request +to be dispatched to the disk controller at a time. In +contrast, many write requests may be dispatched to the disk controller +at a time during a write batch. It is this characteristic that can make +the anticipatory scheduler perform anomalously with controllers supporting +TCQ, or with hardware striped RAID devices. Setting the antic_expire +queue paramter (see below) to zero disables this behavior, and the anticipatory +scheduler behaves essentially like the deadline scheduler. + +When read anticipation is enabled (antic_expire is not zero), reads +are dispatched to the disk controller one at a time. +At the end of each read request, the IO scheduler examines its next +candidate read request from its sorted read list. If that next request +is from the same process as the request that just completed, +or if the next request in the queue is "very close" to the +just completed request, it is dispatched immediately. Otherwise, +statistics (average think time, average seek distance) on the process +that submitted the just completed request are examined. If it seems +likely that that process will submit another request soon, and that +request is likely to be near the just completed request, then the IO +scheduler will stop dispatching more read requests for up time (antic_expire) +milliseconds, hoping that process will submit a new request near the one +that just completed. If such a request is made, then it is dispatched +immediately. If the antic_expire wait time expires, then the IO scheduler +will dispatch the next read request from the sorted read queue. + +To decide whether an anticipatory wait is worthwhile, the scheduler +maintains statistics for each process that can be used to compute +mean "think time" (the time between read requests), and mean seek +distance for that process. One observation is that these statistics +are associated with each process, but those statistics are not associated +with a specific IO device. So for example, if a process is doing IO +on several file systems on separate devices, the statistics will be +a combination of IO behavior from all those devices. + + +Tuning the anticipatory IO scheduler +------------------------------------ +When using 'as', the anticipatory IO scheduler there are 5 parameters under +/sys/block/*/queue/iosched/. All are units of milliseconds. + +The parameters are: +* read_expire + Controls how long until a read request becomes "expired". It also controls the + interval between which expired requests are served, so set to 50, a request + might take anywhere < 100ms to be serviced _if_ it is the next on the + expired list. Obviously request expiration strategies won't make the disk + go faster. The result basically equates to the timeslice a single reader + gets in the presence of other IO. 100*((seek time / read_expire) + 1) is + very roughly the % streaming read efficiency your disk should get with + multiple readers. + +* read_batch_expire + Controls how much time a batch of reads is given before pending writes are + served. A higher value is more efficient. This might be set below read_expire + if writes are to be given higher priority than reads, but reads are to be + as efficient as possible when there are no writes. Generally though, it + should be some multiple of read_expire. + +* write_expire, and +* write_batch_expire are equivalent to the above, for writes. + +* antic_expire + Controls the maximum amount of time we can anticipate a good read (one + with a short seek distance from the most recently completed request) before + giving up. Many other factors may cause anticipation to be stopped early, + or some processes will not be "anticipated" at all. Should be a bit higher + for big seek time devices though not a linear correspondence - most + processes have only a few ms thinktime. + diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt new file mode 100644 index 000000000000..6dd274d7e1cf --- /dev/null +++ b/Documentation/block/biodoc.txt @@ -0,0 +1,1213 @@ + Notes on the Generic Block Layer Rewrite in Linux 2.5 + ===================================================== + +Notes Written on Jan 15, 2002: + Jens Axboe + Suparna Bhattacharya + +Last Updated May 2, 2002 +September 2003: Updated I/O Scheduler portions + Nick Piggin + +Introduction: + +These are some notes describing some aspects of the 2.5 block layer in the +context of the bio rewrite. The idea is to bring out some of the key +changes and a glimpse of the rationale behind those changes. + +Please mail corrections & suggestions to suparna@in.ibm.com. + +Credits: +--------- + +2.5 bio rewrite: + Jens Axboe + +Many aspects of the generic block layer redesign were driven by and evolved +over discussions, prior patches and the collective experience of several +people. See sections 8 and 9 for a list of some related references. + +The following people helped with review comments and inputs for this +document: + Christoph Hellwig + Arjan van de Ven + Randy Dunlap + Andre Hedrick + +The following people helped with fixes/contributions to the bio patches +while it was still work-in-progress: + David S. Miller + + +Description of Contents: +------------------------ + +1. Scope for tuning of logic to various needs + 1.1 Tuning based on device or low level driver capabilities + - Per-queue parameters + - Highmem I/O support + - I/O scheduler modularization + 1.2 Tuning based on high level requirements/capabilities + 1.2.1 I/O Barriers + 1.2.2 Request Priority/Latency + 1.3 Direct access/bypass to lower layers for diagnostics and special + device operations + 1.3.1 Pre-built commands +2. New flexible and generic but minimalist i/o structure or descriptor + (instead of using buffer heads at the i/o layer) + 2.1 Requirements/Goals addressed + 2.2 The bio struct in detail (multi-page io unit) + 2.3 Changes in the request structure +3. Using bios + 3.1 Setup/teardown (allocation, splitting) + 3.2 Generic bio helper routines + 3.2.1 Traversing segments and completion units in a request + 3.2.2 Setting up DMA scatterlists + 3.2.3 I/O completion + 3.2.4 Implications for drivers that do not interpret bios (don't handle + multiple segments) + 3.2.5 Request command tagging + 3.3 I/O submission +4. The I/O scheduler +5. Scalability related changes + 5.1 Granular locking: Removal of io_request_lock + 5.2 Prepare for transition to 64 bit sector_t +6. Other Changes/Implications + 6.1 Partition re-mapping handled by the generic block layer +7. A few tips on migration of older drivers +8. A list of prior/related/impacted patches/ideas +9. Other References/Discussion Threads + +--------------------------------------------------------------------------- + +Bio Notes +-------- + +Let us discuss the changes in the context of how some overall goals for the +block layer are addressed. + +1. Scope for tuning the generic logic to satisfy various requirements + +The block layer design supports adaptable abstractions to handle common +processing with the ability to tune the logic to an appropriate extent +depending on the nature of the device and the requirements of the caller. +One of the objectives of the rewrite was to increase the degree of tunability +and to enable higher level code to utilize underlying device/driver +capabilities to the maximum extent for better i/o performance. This is +important especially in the light of ever improving hardware capabilities +and application/middleware software designed to take advantage of these +capabilities. + +1.1 Tuning based on low level device / driver capabilities + +Sophisticated devices with large built-in caches, intelligent i/o scheduling +optimizations, high memory DMA support, etc may find some of the +generic processing an overhead, while for less capable devices the +generic functionality is essential for performance or correctness reasons. +Knowledge of some of the capabilities or parameters of the device should be +used at the generic block layer to take the right decisions on +behalf of the driver. + +How is this achieved ? + +Tuning at a per-queue level: + +i. Per-queue limits/values exported to the generic layer by the driver + +Various parameters that the generic i/o scheduler logic uses are set at +a per-queue level (e.g maximum request size, maximum number of segments in +a scatter-gather list, hardsect size) + +Some parameters that were earlier available as global arrays indexed by +major/minor are now directly associated with the queue. Some of these may +move into the block device structure in the future. Some characteristics +have been incorporated into a queue flags field rather than separate fields +in themselves. There are blk_queue_xxx functions to set the parameters, +rather than update the fields directly + +Some new queue property settings: + + blk_queue_bounce_limit(q, u64 dma_address) + Enable I/O to highmem pages, dma_address being the + limit. No highmem default. + + blk_queue_max_sectors(q, max_sectors) + Maximum size request you can handle in units of 512 byte + sectors. 255 default. + + blk_queue_max_phys_segments(q, max_segments) + Maximum physical segments you can handle in a request. 128 + default (driver limit). (See 3.2.2) + + blk_queue_max_hw_segments(q, max_segments) + Maximum dma segments the hardware can handle in a request. 128 + default (host adapter limit, after dma remapping). + (See 3.2.2) + + blk_queue_max_segment_size(q, max_seg_size) + Maximum size of a clustered segment, 64kB default. + + blk_queue_hardsect_size(q, hardsect_size) + Lowest possible sector size that the hardware can operate + on, 512 bytes default. + +New queue flags: + + QUEUE_FLAG_CLUSTER (see 3.2.2) + QUEUE_FLAG_QUEUED (see 3.2.4) + + +ii. High-mem i/o capabilities are now considered the default + +The generic bounce buffer logic, present in 2.4, where the block layer would +by default copyin/out i/o requests on high-memory buffers to low-memory buffers +assuming that the driver wouldn't be able to handle it directly, has been +changed in 2.5. The bounce logic is now applied only for memory ranges +for which the device cannot handle i/o. A driver can specify this by +setting the queue bounce limit for the request queue for the device +(blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out +where a device is capable of handling high memory i/o. + +In order to enable high-memory i/o where the device is capable of supporting +it, the pci dma mapping routines and associated data structures have now been +modified to accomplish a direct page -> bus translation, without requiring +a virtual address mapping (unlike the earlier scheme of virtual address +-> bus translation). So this works uniformly for high-memory pages (which +do not have a correponding kernel virtual address space mapping) and +low-memory pages. + +Note: Please refer to DMA-mapping.txt for a discussion on PCI high mem DMA +aspects and mapping of scatter gather lists, and support for 64 bit PCI. + +Special handling is required only for cases where i/o needs to happen on +pages at physical memory addresses beyond what the device can support. In these +cases, a bounce bio representing a buffer from the supported memory range +is used for performing the i/o with copyin/copyout as needed depending on +the type of the operation. For example, in case of a read operation, the +data read has to be copied to the original buffer on i/o completion, so a +callback routine is set up to do this, while for write, the data is copied +from the original buffer to the bounce buffer prior to issuing the +operation. Since an original buffer may be in a high memory area that's not +mapped in kernel virtual addr, a kmap operation may be required for +performing the copy, and special care may be needed in the completion path +as it may not be in irq context. Special care is also required (by way of +GFP flags) when allocating bounce buffers, to avoid certain highmem +deadlock possibilities. + +It is also possible that a bounce buffer may be allocated from high-memory +area that's not mapped in kernel virtual addr, but within the range that the +device can use directly; so the bounce page may need to be kmapped during +copy operations. [Note: This does not hold in the current implementation, +though] + +There are some situations when pages from high memory may need to +be kmapped, even if bounce buffers are not necessary. For example a device +may need to abort DMA operations and revert to PIO for the transfer, in +which case a virtual mapping of the page is required. For SCSI it is also +done in some scenarios where the low level driver cannot be trusted to +handle a single sg entry correctly. The driver is expected to perform the +kmaps as needed on such occasions using the __bio_kmap_atomic and bio_kmap_irq +routines as appropriate. A driver could also use the blk_queue_bounce() +routine on its own to bounce highmem i/o to low memory for specific requests +if so desired. + +iii. The i/o scheduler algorithm itself can be replaced/set as appropriate + +As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular +queue or pick from (copy) existing generic schedulers and replace/override +certain portions of it. The 2.5 rewrite provides improved modularization +of the i/o scheduler. There are more pluggable callbacks, e.g for init, +add request, extract request, which makes it possible to abstract specific +i/o scheduling algorithm aspects and details outside of the generic loop. +It also makes it possible to completely hide the implementation details of +the i/o scheduler from block drivers. + +I/O scheduler wrappers are to be used instead of accessing the queue directly. +See section 4. The I/O scheduler for details. + +1.2 Tuning Based on High level code capabilities + +i. Application capabilities for raw i/o + +This comes from some of the high-performance database/middleware +requirements where an application prefers to make its own i/o scheduling +decisions based on an understanding of the access patterns and i/o +characteristics + +ii. High performance filesystems or other higher level kernel code's +capabilities + +Kernel components like filesystems could also take their own i/o scheduling +decisions for optimizing performance. Journalling filesystems may need +some control over i/o ordering. + +What kind of support exists at the generic block layer for this ? + +The flags and rw fields in the bio structure can be used for some tuning +from above e.g indicating that an i/o is just a readahead request, or for +marking barrier requests (discussed next), or priority settings (currently +unused). As far as user applications are concerned they would need an +additional mechanism either via open flags or ioctls, or some other upper +level mechanism to communicate such settings to block. + +1.2.1 I/O Barriers + +There is a way to enforce strict ordering for i/os through barriers. +All requests before a barrier point must be serviced before the barrier +request and any other requests arriving after the barrier will not be +serviced until after the barrier has completed. This is useful for higher +level control on write ordering, e.g flushing a log of committed updates +to disk before the corresponding updates themselves. + +A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o. +The generic i/o scheduler would make sure that it places the barrier request and +all other requests coming after it after all the previous requests in the +queue. Barriers may be implemented in different ways depending on the +driver. A SCSI driver for example could make use of ordered tags to +preserve the necessary ordering with a lower impact on throughput. For IDE +this might be two sync cache flush: a pre and post flush when encountering +a barrier write. + +There is a provision for queues to indicate what kind of barriers they +can provide. This is as of yet unmerged, details will be added here once it +is in the kernel. + +1.2.2 Request Priority/Latency + +Todo/Under discussion: +Arjan's proposed request priority scheme allows higher levels some broad + control (high/med/low) over the priority of an i/o request vs other pending + requests in the queue. For example it allows reads for bringing in an + executable page on demand to be given a higher priority over pending write + requests which haven't aged too much on the queue. Potentially this priority + could even be exposed to applications in some manner, providing higher level + tunability. Time based aging avoids starvation of lower priority + requests. Some bits in the bi_rw flags field in the bio structure are + intended to be used for this priority information. + + +1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode) + (e.g Diagnostics, Systems Management) + +There are situations where high-level code needs to have direct access to +the low level device capabilities or requires the ability to issue commands +to the device bypassing some of the intermediate i/o layers. +These could, for example, be special control commands issued through ioctl +interfaces, or could be raw read/write commands that stress the drive's +capabilities for certain kinds of fitness tests. Having direct interfaces at +multiple levels without having to pass through upper layers makes +it possible to perform bottom up validation of the i/o path, layer by +layer, starting from the media. + +The normal i/o submission interfaces, e.g submit_bio, could be bypassed +for specially crafted requests which such ioctl or diagnostics +interfaces would typically use, and the elevator add_request routine +can instead be used to directly insert such requests in the queue or preferably +the blk_do_rq routine can be used to place the request on the queue and +wait for completion. Alternatively, sometimes the caller might just +invoke a lower level driver specific interface with the request as a +parameter. + +If the request is a means for passing on special information associated with +the command, then such information is associated with the request->special +field (rather than misuse the request->buffer field which is meant for the +request data buffer's virtual mapping). + +For passing request data, the caller must build up a bio descriptor +representing the concerned memory buffer if the underlying driver interprets +bio segments or uses the block layer end*request* functions for i/o +completion. Alternatively one could directly use the request->buffer field to +specify the virtual address of the buffer, if the driver expects buffer +addresses passed in this way and ignores bio entries for the request type +involved. In the latter case, the driver would modify and manage the +request->buffer, request->sector and request->nr_sectors or +request->current_nr_sectors fields itself rather than using the block layer +end_request or end_that_request_first completion interfaces. +(See 2.3 or Documentation/block/request.txt for a brief explanation of +the request structure fields) + +[TBD: end_that_request_last should be usable even in this case; +Perhaps an end_that_direct_request_first routine could be implemented to make +handling direct requests easier for such drivers; Also for drivers that +expect bios, a helper function could be provided for setting up a bio +corresponding to a data buffer] + + + + +1.3.1 Pre-built Commands + +A request can be created with a pre-built custom command to be sent directly +to the device. The cmd block in the request structure has room for filling +in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for +command pre-building, and the type of the request is now indicated +through rq->flags instead of via rq->cmd) + +The request structure flags can be set up to indicate the type of request +in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC: +packet command issued via blk_do_rq, REQ_SPECIAL: special request). + +It can help to pre-build device commands for requests in advance. +Drivers can now specify a request prepare function (q->prep_rq_fn) that the +block layer would invoke to pre-build device commands for a given request, +or perform other preparatory processing for the request. This is routine is +called by elv_next_request(), i.e. typically just before servicing a request. +(The prepare function would not be called for requests that have REQ_DONTPREP +enabled) + +Aside: + Pre-building could possibly even be done early, i.e before placing the + request on the queue, rather than construct the command on the fly in the + driver while servicing the request queue when it may affect latencies in + interrupt context or responsiveness in general. One way to add early + pre-building would be to do it whenever we fail to merge on a request. + Now REQ_NOMERGE is set in the request flags to skip this one in the future, + which means that it will not change before we feed it to the device. So + the pre-builder hook can be invoked there. + + +2. Flexible and generic but minimalist i/o structure/descriptor. + +2.1 Reason for a new structure and requirements addressed + +Prior to 2.5, buffer heads were used as the unit of i/o at the generic block +layer, and the low level request structure was associated with a chain of +buffer heads for a contiguous i/o request. This led to certain inefficiencies +when it came to large i/o requests and readv/writev style operations, as it +forced such requests to be broken up into small chunks before being passed +on to the generic block layer, only to be merged by the i/o scheduler +when the underlying device was capable of handling the i/o in one shot. +Also, using the buffer head as an i/o structure for i/os that didn't originate +from the buffer cache unecessarily added to the weight of the descriptors +which were generated for each such chunk. + +The following were some of the goals and expectations considered in the +redesign of the block i/o data structure in 2.5. + +i. Should be appropriate as a descriptor for both raw and buffered i/o - + avoid cache related fields which are irrelevant in the direct/page i/o path, + or filesystem block size alignment restrictions which may not be relevant + for raw i/o. +ii. Ability to represent high-memory buffers (which do not have a virtual + address mapping in kernel address space). +iii.Ability to represent large i/os w/o unecessarily breaking them up (i.e + greater than PAGE_SIZE chunks in one shot) +iv. At the same time, ability to retain independent identity of i/os from + different sources or i/o units requiring individual completion (e.g. for + latency reasons) +v. Ability to represent an i/o involving multiple physical memory segments + (including non-page aligned page fragments, as specified via readv/writev) + without unecessarily breaking it up, if the underlying device is capable of + handling it. +vi. Preferably should be based on a memory descriptor structure that can be + passed around different types of subsystems or layers, maybe even + networking, without duplication or extra copies of data/descriptor fields + themselves in the process +vii.Ability to handle the possibility of splits/merges as the structure passes + through layered drivers (lvm, md, evms), with minimal overhead. + +The solution was to define a new structure (bio) for the block layer, +instead of using the buffer head structure (bh) directly, the idea being +avoidance of some associated baggage and limitations. The bio structure +is uniformly used for all i/o at the block layer ; it forms a part of the +bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are +mapped to bio structures. + +2.2 The bio struct + +The bio structure uses a vector representation pointing to an array of tuples +of to describe the i/o buffer, and has various other +fields describing i/o parameters and state that needs to be maintained for +performing the i/o. + +Notice that this representation means that a bio has no virtual address +mapping at all (unlike buffer heads). + +struct bio_vec { + struct page *bv_page; + unsigned short bv_len; + unsigned short bv_offset; +}; + +/* + * main unit of I/O for the block layer and lower layers (ie drivers) + */ +struct bio { + sector_t bi_sector; + struct bio *bi_next; /* request queue link */ + struct block_device *bi_bdev; /* target device */ + unsigned long bi_flags; /* status, command, etc */ + unsigned long bi_rw; /* low bits: r/w, high: priority */ + + unsigned int bi_vcnt; /* how may bio_vec's */ + unsigned int bi_idx; /* current index into bio_vec array */ + + unsigned int bi_size; /* total size in bytes */ + unsigned short bi_phys_segments; /* segments after physaddr coalesce*/ + unsigned short bi_hw_segments; /* segments after DMA remapping */ + unsigned int bi_max; /* max bio_vecs we can hold + used as index into pool */ + struct bio_vec *bi_io_vec; /* the actual vec list */ + bio_end_io_t *bi_end_io; /* bi_end_io (bio) */ + atomic_t bi_cnt; /* pin count: free when it hits zero */ + void *bi_private; + bio_destructor_t *bi_destructor; /* bi_destructor (bio) */ +}; + +With this multipage bio design: + +- Large i/os can be sent down in one go using a bio_vec list consisting + of an array of fragments (similar to the way fragments + are represented in the zero-copy network code) +- Splitting of an i/o request across multiple devices (as in the case of + lvm or raid) is achieved by cloning the bio (where the clone points to + the same bi_io_vec array, but with the index and size accordingly modified) +- A linked list of bios is used as before for unrelated merges (*) - this + avoids reallocs and makes independent completions easier to handle. +- Code that traverses the req list needs to make a distinction between + segments of a request (bio_for_each_segment) and the distinct completion + units/bios (rq_for_each_bio). +- Drivers which can't process a large bio in one shot can use the bi_idx + field to keep track of the next bio_vec entry to process. + (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE) + [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying + bi_offset an len fields] + +(*) unrelated merges -- a request ends up containing two or more bios that + didn't originate from the same place. + +bi_end_io() i/o callback gets called on i/o completion of the entire bio. + +At a lower level, drivers build a scatter gather list from the merged bios. +The scatter gather list is in the form of an array of +entries with their corresponding dma address mappings filled in at the +appropriate time. As an optimization, contiguous physical pages can be +covered by a single entry where refers to the first page and +covers the range of pages (upto 16 contiguous pages could be covered this +way). There is a helper routine (blk_rq_map_sg) which drivers can use to build +the sg list. + +Note: Right now the only user of bios with more than one page is ll_rw_kio, +which in turn means that only raw I/O uses it (direct i/o may not work +right now). The intent however is to enable clustering of pages etc to +become possible. The pagebuf abstraction layer from SGI also uses multi-page +bios, but that is currently not included in the stock development kernels. +The same is true of Andrew Morton's work-in-progress multipage bio writeout +and readahead patches. + +2.3 Changes in the Request Structure + +The request structure is the structure that gets passed down to low level +drivers. The block layer make_request function builds up a request structure, +places it on the queue and invokes the drivers request_fn. The driver makes +use of block layer helper routine elv_next_request to pull the next request +off the queue. Control or diagnostic functions might bypass block and directly +invoke underlying driver entry points passing in a specially constructed +request structure. + +Only some relevant fields (mainly those which changed or may be referred +to in some of the discussion here) are listed below, not necessarily in +the order in which they occur in the structure (see include/linux/blkdev.h) +Refer to Documentation/block/request.txt for details about all the request +structure fields and a quick reference about the layers which are +supposed to use or modify those fields. + +struct request { + struct list_head queuelist; /* Not meant to be directly accessed by + the driver. + Used by q->elv_next_request_fn + rq->queue is gone + */ + . + . + unsigned char cmd[16]; /* prebuilt command data block */ + unsigned long flags; /* also includes earlier rq->cmd settings */ + . + . + sector_t sector; /* this field is now of type sector_t instead of int + preparation for 64 bit sectors */ + . + . + + /* Number of scatter-gather DMA addr+len pairs after + * physical address coalescing is performed. + */ + unsigned short nr_phys_segments; + + /* Number of scatter-gather addr+len pairs after + * physical and DMA remapping hardware coalescing is performed. + * This is the number of scatter-gather entries the driver + * will actually have to deal with after DMA mapping is done. + */ + unsigned short nr_hw_segments; + + /* Various sector counts */ + unsigned long nr_sectors; /* no. of sectors left: driver modifiable */ + unsigned long hard_nr_sectors; /* block internal copy of above */ + unsigned int current_nr_sectors; /* no. of sectors left in the + current segment:driver modifiable */ + unsigned long hard_cur_sectors; /* block internal copy of the above */ + . + . + int tag; /* command tag associated with request */ + void *special; /* same as before */ + char *buffer; /* valid only for low memory buffers upto + current_nr_sectors */ + . + . + struct bio *bio, *biotail; /* bio list instead of bh */ + struct request_list *rl; +} + +See the rq_flag_bits definitions for an explanation of the various flags +available. Some bits are used by the block layer or i/o scheduler. + +The behaviour of the various sector counts are almost the same as before, +except that since we have multi-segment bios, current_nr_sectors refers +to the numbers of sectors in the current segment being processed which could +be one of the many segments in the current bio (i.e i/o completion unit). +The nr_sectors value refers to the total number of sectors in the whole +request that remain to be transferred (no change). The purpose of the +hard_xxx values is for block to remember these counts every time it hands +over the request to the driver. These values are updated by block on +end_that_request_first, i.e. every time the driver completes a part of the +transfer and invokes block end*request helpers to mark this. The +driver should not modify these values. The block layer sets up the +nr_sectors and current_nr_sectors fields (based on the corresponding +hard_xxx values and the number of bytes transferred) and updates it on +every transfer that invokes end_that_request_first. It does the same for the +buffer, bio, bio->bi_idx fields too. + +The buffer field is just a virtual address mapping of the current segment +of the i/o buffer in cases where the buffer resides in low-memory. For high +memory i/o, this field is not valid and must not be used by drivers. + +Code that sets up its own request structures and passes them down to +a driver needs to be careful about interoperation with the block layer helper +functions which the driver uses. (Section 1.3) + +3. Using bios + +3.1 Setup/Teardown + +There are routines for managing the allocation, and reference counting, and +freeing of bios (bio_alloc, bio_get, bio_put). + +This makes use of Ingo Molnar's mempool implementation, which enables +subsystems like bio to maintain their own reserve memory pools for guaranteed +deadlock-free allocations during extreme VM load. For example, the VM +subsystem makes use of the block layer to writeout dirty pages in order to be +able to free up memory space, a case which needs careful handling. The +allocation logic draws from the preallocated emergency reserve in situations +where it cannot allocate through normal means. If the pool is empty and it +can wait, then it would trigger action that would help free up memory or +replenish the pool (without deadlocking) and wait for availability in the pool. +If it is in IRQ context, and hence not in a position to do this, allocation +could fail if the pool is empty. In general mempool always first tries to +perform allocation without having to wait, even if it means digging into the +pool as long it is not less that 50% full. + +On a free, memory is released to the pool or directly freed depending on +the current availability in the pool. The mempool interface lets the +subsystem specify the routines to be used for normal alloc and free. In the +case of bio, these routines make use of the standard slab allocator. + +The caller of bio_alloc is expected to taken certain steps to avoid +deadlocks, e.g. avoid trying to allocate more memory from the pool while +already holding memory obtained from the pool. +[TBD: This is a potential issue, though a rare possibility + in the bounce bio allocation that happens in the current code, since + it ends up allocating a second bio from the same pool while + holding the original bio ] + +Memory allocated from the pool should be released back within a limited +amount of time (in the case of bio, that would be after the i/o is completed). +This ensures that if part of the pool has been used up, some work (in this +case i/o) must already be in progress and memory would be available when it +is over. If allocating from multiple pools in the same code path, the order +or hierarchy of allocation needs to be consistent, just the way one deals +with multiple locks. + +The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc()) +for a non-clone bio. There are the 6 pools setup for different size biovecs, +so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the +given size from these slabs. + +The bi_destructor() routine takes into account the possibility of the bio +having originated from a different source (see later discussions on +n/w to block transfers and kvec_cb) + +The bio_get() routine may be used to hold an extra reference on a bio prior +to i/o submission, if the bio fields are likely to be accessed after the +i/o is issued (since the bio may otherwise get freed in case i/o completion +happens in the meantime). + +The bio_clone() routine may be used to duplicate a bio, where the clone +shares the bio_vec_list with the original bio (i.e. both point to the +same bio_vec_list). This would typically be used for splitting i/o requests +in lvm or md. + +3.2 Generic bio helper Routines + +3.2.1 Traversing segments and completion units in a request + +The macros bio_for_each_segment() and rq_for_each_bio() should be used for +traversing the bios in the request list (drivers should avoid directly +trying to do it themselves). Using these helpers should also make it easier +to cope with block changes in the future. + + rq_for_each_bio(bio, rq) + bio_for_each_segment(bio_vec, bio, i) + /* bio_vec is now current segment */ + +I/O completion callbacks are per-bio rather than per-segment, so drivers +that traverse bio chains on completion need to keep that in mind. Drivers +which don't make a distinction between segments and completion units would +need to be reorganized to support multi-segment bios. + +3.2.2 Setting up DMA scatterlists + +The blk_rq_map_sg() helper routine would be used for setting up scatter +gather lists from a request, so a driver need not do it on its own. + + nr_segments = blk_rq_map_sg(q, rq, scatterlist); + +The helper routine provides a level of abstraction which makes it easier +to modify the internals of request to scatterlist conversion down the line +without breaking drivers. The blk_rq_map_sg routine takes care of several +things like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER +is set) and correct segment accounting to avoid exceeding the limits which +the i/o hardware can handle, based on various queue properties. + +- Prevents a clustered segment from crossing a 4GB mem boundary +- Avoids building segments that would exceed the number of physical + memory segments that the driver can handle (phys_segments) and the + number that the underlying hardware can handle at once, accounting for + DMA remapping (hw_segments) (i.e. IOMMU aware limits). + +Routines which the low level driver can use to set up the segment limits: + +blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of +hw data segments in a request (i.e. the maximum number of address/length +pairs the host adapter can actually hand to the device at once) + +blk_queue_max_phys_segments() : Sets an upper limit on the maximum number +of physical data segments in a request (i.e. the largest sized scatter list +a driver could handle) + +3.2.3 I/O completion + +The existing generic block layer helper routines end_request, +end_that_request_first and end_that_request_last can be used for i/o +completion (and setting things up so the rest of the i/o or the next +request can be kicked of) as before. With the introduction of multi-page +bio support, end_that_request_first requires an additional argument indicating +the number of sectors completed. + +3.2.4 Implications for drivers that do not interpret bios (don't handle + multiple segments) + +Drivers that do not interpret bios e.g those which do not handle multiple +segments and do not support i/o into high memory addresses (require bounce +buffers) and expect only virtually mapped buffers, can access the rq->buffer +field. As before the driver should use current_nr_sectors to determine the +size of remaining data in the current segment (that is the maximum it can +transfer in one go unless it interprets segments), and rely on the block layer +end_request, or end_that_request_first/last to take care of all accounting +and transparent mapping of the next bio segment when a segment boundary +is crossed on completion of a transfer. (The end*request* functions should +be used if only if the request has come down from block/bio path, not for +direct access requests which only specify rq->buffer without a valid rq->bio) + +3.2.5 Generic request command tagging + +3.2.5.1 Tag helpers + +Block now offers some simple generic functionality to help support command +queueing (typically known as tagged command queueing), ie manage more than +one outstanding command on a queue at any given time. + + blk_queue_init_tags(request_queue_t *q, int depth) + + Initialize internal command tagging structures for a maximum + depth of 'depth'. + + blk_queue_free_tags((request_queue_t *q) + + Teardown tag info associated with the queue. This will be done + automatically by block if blk_queue_cleanup() is called on a queue + that is using tagging. + +The above are initialization and exit management, the main helpers during +normal operations are: + + blk_queue_start_tag(request_queue_t *q, struct request *rq) + + Start tagged operation for this request. A free tag number between + 0 and 'depth' is assigned to the request (rq->tag holds this number), + and 'rq' is added to the internal tag management. If the maximum depth + for this queue is already achieved (or if the tag wasn't started for + some other reason), 1 is returned. Otherwise 0 is returned. + + blk_queue_end_tag(request_queue_t *q, struct request *rq) + + End tagged operation on this request. 'rq' is removed from the internal + book keeping structures. + +To minimize struct request and queue overhead, the tag helpers utilize some +of the same request members that are used for normal request queue management. +This means that a request cannot both be an active tag and be on the queue +list at the same time. blk_queue_start_tag() will remove the request, but +the driver must remember to call blk_queue_end_tag() before signalling +completion of the request to the block layer. This means ending tag +operations before calling end_that_request_last()! For an example of a user +of these helpers, see the IDE tagged command queueing support. + +Certain hardware conditions may dictate a need to invalidate the block tag +queue. For instance, on IDE any tagged request error needs to clear both +the hardware and software block queue and enable the driver to sanely restart +all the outstanding requests. There's a third helper to do that: + + blk_queue_invalidate_tags(request_queue_t *q) + + Clear the internal block tag queue and readd all the pending requests + to the request queue. The driver will receive them again on the + next request_fn run, just like it did the first time it encountered + them. + +3.2.5.2 Tag info + +Some block functions exist to query current tag status or to go from a +tag number to the associated request. These are, in no particular order: + + blk_queue_tagged(q) + + Returns 1 if the queue 'q' is using tagging, 0 if not. + + blk_queue_tag_request(q, tag) + + Returns a pointer to the request associated with tag 'tag'. + + blk_queue_tag_depth(q) + + Return current queue depth. + + blk_queue_tag_queue(q) + + Returns 1 if the queue can accept a new queued command, 0 if we are + at the maximum depth already. + + blk_queue_rq_tagged(rq) + + Returns 1 if the request 'rq' is tagged. + +3.2.5.2 Internal structure + +Internally, block manages tags in the blk_queue_tag structure: + + struct blk_queue_tag { + struct request **tag_index; /* array or pointers to rq */ + unsigned long *tag_map; /* bitmap of free tags */ + struct list_head busy_list; /* fifo list of busy tags */ + int busy; /* queue depth */ + int max_depth; /* max queue depth */ + }; + +Most of the above is simple and straight forward, however busy_list may need +a bit of explaining. Normally we don't care too much about request ordering, +but in the event of any barrier requests in the tag queue we need to ensure +that requests are restarted in the order they were queue. This may happen +if the driver needs to use blk_queue_invalidate_tags(). + +Tagging also defines a new request flag, REQ_QUEUED. This is set whenever +a request is currently tagged. You should not use this flag directly, +blk_rq_tagged(rq) is the portable way to do so. + +3.3 I/O Submission + +The routine submit_bio() is used to submit a single io. Higher level i/o +routines make use of this: + +(a) Buffered i/o: +The routine submit_bh() invokes submit_bio() on a bio corresponding to the +bh, allocating the bio if required. ll_rw_block() uses submit_bh() as before. + +(b) Kiobuf i/o (for raw/direct i/o): +The ll_rw_kio() routine breaks up the kiobuf into page sized chunks and +maps the array to one or more multi-page bios, issuing submit_bio() to +perform the i/o on each of these. + +The embedded bh array in the kiobuf structure has been removed and no +preallocation of bios is done for kiobufs. [The intent is to remove the +blocks array as well, but it's currently in there to kludge around direct i/o.] +Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc. + +Todo/Observation: + + A single kiobuf structure is assumed to correspond to a contiguous range + of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec. + So right now it wouldn't work for direct i/o on non-contiguous blocks. + This is to be resolved. The eventual direction is to replace kiobuf + by kvec's. + + Badari Pulavarty has a patch to implement direct i/o correctly using + bio and kvec. + + +(c) Page i/o: +Todo/Under discussion: + + Andrew Morton's multi-page bio patches attempt to issue multi-page + writeouts (and reads) from the page cache, by directly building up + large bios for submission completely bypassing the usage of buffer + heads. This work is still in progress. + + Christoph Hellwig had some code that uses bios for page-io (rather than + bh). This isn't included in bio as yet. Christoph was also working on a + design for representing virtual/real extents as an entity and modifying + some of the address space ops interfaces to utilize this abstraction rather + than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf + abstraction, but intended to be as lightweight as possible). + +(d) Direct access i/o: +Direct access requests that do not contain bios would be submitted differently +as discussed earlier in section 1.3. + +Aside: + + Kvec i/o: + + Ben LaHaise's aio code uses a slighly different structure instead + of kiobufs, called a kvec_cb. This contains an array of + tuples (very much like the networking code), together with a callback function + and data pointer. This is embedded into a brw_cb structure when passed + to brw_kvec_async(). + + Now it should be possible to directly map these kvecs to a bio. Just as while + cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec + array pointer to point to the veclet array in kvecs. + + TBD: In order for this to work, some changes are needed in the way multi-page + bios are handled today. The values of the tuples in such a vector passed in + from higher level code should not be modified by the block layer in the course + of its request processing, since that would make it hard for the higher layer + to continue to use the vector descriptor (kvec) after i/o completes. Instead, + all such transient state should either be maintained in the request structure, + and passed on in some way to the endio completion routine. + + +4. The I/O scheduler +I/O schedulers are now per queue. They should be runtime switchable and modular +but aren't yet. Jens has most bits to do this, but the sysfs implementation is +missing. + +A block layer call to the i/o scheduler follows the convention elv_xxx(). This +calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh, +xxx and xxx might not match exactly, but use your imagination. If an elevator +doesn't implement a function, the switch does nothing or some minimal house +keeping work. + +4.1. I/O scheduler API + +The functions an elevator may implement are: (* are mandatory) +elevator_merge_fn called to query requests for merge with a bio + +elevator_merge_req_fn " " " with another request + +elevator_merged_fn called when a request in the scheduler has been + involved in a merge. It is used in the deadline + scheduler for example, to reposition the request + if its sorting order has changed. + +*elevator_next_req_fn returns the next scheduled request, or NULL + if there are none (or none are ready). + +*elevator_add_req_fn called to add a new request into the scheduler + +elevator_queue_empty_fn returns true if the merge queue is empty. + Drivers shouldn't use this, but rather check + if elv_next_request is NULL (without losing the + request if one exists!) + +elevator_remove_req_fn This is called when a driver claims ownership of + the target request - it now belongs to the + driver. It must not be modified or merged. + Drivers must not lose the request! A subsequent + call of elevator_next_req_fn must return the + _next_ request. + +elevator_requeue_req_fn called to add a request to the scheduler. This + is used when the request has alrnadebeen + returned by elv_next_request, but hasn't + completed. If this is not implemented then + elevator_add_req_fn is called instead. + +elevator_former_req_fn +elevator_latter_req_fn These return the request before or after the + one specified in disk sort order. Used by the + block layer to find merge possibilities. + +elevator_completed_req_fn called when a request is completed. This might + come about due to being merged with another or + when the device completes the request. + +elevator_may_queue_fn returns true if the scheduler wants to allow the + current context to queue a new request even if + it is over the queue limit. This must be used + very carefully!! + +elevator_set_req_fn +elevator_put_req_fn Must be used to allocate and free any elevator + specific storate for a request. + +elevator_init_fn +elevator_exit_fn Allocate and free any elevator specific storage + for a queue. + +4.2 I/O scheduler implementation +The generic i/o scheduler algorithm attempts to sort/merge/batch requests for +optimal disk scan and request servicing performance (based on generic +principles and device capabilities), optimized for: +i. improved throughput +ii. improved latency +iii. better utilization of h/w & CPU time + +Characteristics: + +i. Binary tree +AS and deadline i/o schedulers use red black binary trees for disk position +sorting and searching, and a fifo linked list for time-based searching. This +gives good scalability and good availablility of information. Requests are +almost always dispatched in disk sort order, so a cache is kept of the next +request in sort order to prevent binary tree lookups. + +This arrangement is not a generic block layer characteristic however, so +elevators may implement queues as they please. + +ii. Last merge hint +The last merge hint is part of the generic queue layer. I/O schedulers must do +some management on it. For the most part, the most important thing is to make +sure q->last_merge is cleared (set to NULL) when the request on it is no longer +a candidate for merging (for example if it has been sent to the driver). + +The last merge performed is cached as a hint for the subsequent request. If +sequential data is being submitted, the hint is used to perform merges without +any scanning. This is not sufficient when there are multiple processes doing +I/O though, so a "merge hash" is used by some schedulers. + +iii. Merge hash +AS and deadline use a hash table indexed by the last sector of a request. This +enables merging code to quickly look up "back merge" candidates, even when +multiple I/O streams are being performed at once on one disk. + +"Front merges", a new request being merged at the front of an existing request, +are far less common than "back merges" due to the nature of most I/O patterns. +Front merges are handled by the binary trees in AS and deadline schedulers. + +iv. Handling barrier cases +A request with flags REQ_HARDBARRIER or REQ_SOFTBARRIER must not be ordered +around. That is, they must be processed after all older requests, and before +any newer ones. This includes merges! + +In AS and deadline schedulers, barriers have the effect of flushing the reorder +queue. The performance cost of this will vary from nothing to a lot depending +on i/o patterns and device characteristics. Obviously they won't improve +performance, so their use should be kept to a minimum. + +v. Handling insertion position directives +A request may be inserted with a position directive. The directives are one of +ELEVATOR_INSERT_BACK, ELEVATOR_INSERT_FRONT, ELEVATOR_INSERT_SORT. + +ELEVATOR_INSERT_SORT is a general directive for non-barrier requests. +ELEVATOR_INSERT_BACK is used to insert a barrier to the back of the queue. +ELEVATOR_INSERT_FRONT is used to insert a barrier to the front of the queue, and +overrides the ordering requested by any previous barriers. In practice this is +harmless and required, because it is used for SCSI requeueing. This does not +require flushing the reorder queue, so does not impose a performance penalty. + +vi. Plugging the queue to batch requests in anticipation of opportunities for + merge/sort optimizations + +This is just the same as in 2.4 so far, though per-device unplugging +support is anticipated for 2.5. Also with a priority-based i/o scheduler, +such decisions could be based on request priorities. + +Plugging is an approach that the current i/o scheduling algorithm resorts to so +that it collects up enough requests in the queue to be able to take +advantage of the sorting/merging logic in the elevator. If the +queue is empty when a request comes in, then it plugs the request queue +(sort of like plugging the bottom of a vessel to get fluid to build up) +till it fills up with a few more requests, before starting to service +the requests. This provides an opportunity to merge/sort the requests before +passing them down to the device. There are various conditions when the queue is +unplugged (to open up the flow again), either through a scheduled task or +could be on demand. For example wait_on_buffer sets the unplugging going +(by running tq_disk) so the read gets satisfied soon. So in the read case, +the queue gets explicitly unplugged as part of waiting for completion, +in fact all queues get unplugged as a side-effect. + +Aside: + This is kind of controversial territory, as it's not clear if plugging is + always the right thing to do. Devices typically have their own queues, + and allowing a big queue to build up in software, while letting the device be + idle for a while may not always make sense. The trick is to handle the fine + balance between when to plug and when to open up. Also now that we have + multi-page bios being queued in one shot, we may not need to wait to merge + a big request from the broken up pieces coming by. + + Per-queue granularity unplugging (still a Todo) may help reduce some of the + concerns with just a single tq_disk flush approach. Something like + blk_kick_queue() to unplug a specific queue (right away ?) + or optionally, all queues, is in the plan. + +4.3 I/O contexts +I/O contexts provide a dynamically allocated per process data area. They may +be used in I/O schedulers, and in the block layer (could be used for IO statis, +priorities for example). See *io_context in drivers/block/ll_rw_blk.c, and +as-iosched.c for an example of usage in an i/o scheduler. + + +5. Scalability related changes + +5.1 Granular Locking: io_request_lock replaced by a per-queue lock + +The global io_request_lock has been removed as of 2.5, to avoid +the scalability bottleneck it was causing, and has been replaced by more +granular locking. The request queue structure has a pointer to the +lock to be used for that queue. As a result, locking can now be +per-queue, with a provision for sharing a lock across queues if +necessary (e.g the scsi layer sets the queue lock pointers to the +corresponding adapter lock, which results in a per host locking +granularity). The locking semantics are the same, i.e. locking is +still imposed by the block layer, grabbing the lock before +request_fn execution which it means that lots of older drivers +should still be SMP safe. Drivers are free to drop the queue +lock themselves, if required. Drivers that explicitly used the +io_request_lock for serialization need to be modified accordingly. +Usually it's as easy as adding a global lock: + + static spinlock_t my_driver_lock = SPIN_LOCK_UNLOCKED; + +and passing the address to that lock to blk_init_queue(). + +5.2 64 bit sector numbers (sector_t prepares for 64 bit support) + +The sector number used in the bio structure has been changed to sector_t, +which could be defined as 64 bit in preparation for 64 bit sector support. + +6. Other Changes/Implications + +6.1 Partition re-mapping handled by the generic block layer + +In 2.5 some of the gendisk/partition related code has been reorganized. +Now the generic block layer performs partition-remapping early and thus +provides drivers with a sector number relative to whole device, rather than +having to take partition number into account in order to arrive at the true +sector number. The routine blk_partition_remap() is invoked by +generic_make_request even before invoking the queue specific make_request_fn, +so the i/o scheduler also gets to operate on whole disk sector numbers. This +should typically not require changes to block drivers, it just never gets +to invoke its own partition sector offset calculations since all bios +sent are offset from the beginning of the device. + + +7. A Few Tips on Migration of older drivers + +Old-style drivers that just use CURRENT and ignores clustered requests, +may not need much change. The generic layer will automatically handle +clustered requests, multi-page bios, etc for the driver. + +For a low performance driver or hardware that is PIO driven or just doesn't +support scatter-gather changes should be minimal too. + +The following are some points to keep in mind when converting old drivers +to bio. + +Drivers should use elv_next_request to pick up requests and are no longer +supposed to handle looping directly over the request list. +(struct request->queue has been removed) + +Now end_that_request_first takes an additional number_of_sectors argument. +It used to handle always just the first buffer_head in a request, now +it will loop and handle as many sectors (on a bio-segment granularity) +as specified. + +Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the +right thing to use is bio_endio(bio, uptodate) instead. + +If the driver is dropping the io_request_lock from its request_fn strategy, +then it just needs to replace that with q->queue_lock instead. + +As described in Sec 1.1, drivers can set max sector size, max segment size +etc per queue now. Drivers that used to define their own merge functions i +to handle things like this can now just use the blk_queue_* functions at +blk_init_queue time. + +Drivers no longer have to map a {partition, sector offset} into the +correct absolute location anymore, this is done by the block layer, so +where a driver received a request ala this before: + + rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */ + rq->sector = 0; /* first sector on hda5 */ + + it will now see + + rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */ + rq->sector = 123128; /* offset from start of disk */ + +As mentioned, there is no virtual mapping of a bio. For DMA, this is +not a problem as the driver probably never will need a virtual mapping. +Instead it needs a bus mapping (pci_map_page for a single segment or +use blk_rq_map_sg for scatter gather) to be able to ship it to the driver. For +PIO drivers (or drivers that need to revert to PIO transfer once in a +while (IDE for example)), where the CPU is doing the actual data +transfer a virtual mapping is needed. If the driver supports highmem I/O, +(Sec 1.1, (ii) ) it needs to use __bio_kmap_atomic and bio_kmap_irq to +temporarily map a bio into the virtual address space. + + +8. Prior/Related/Impacted patches + +8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp) +- orig kiobuf & raw i/o patches (now in 2.4 tree) +- direct kiobuf based i/o to devices (no intermediate bh's) +- page i/o using kiobuf +- kiobuf splitting for lvm (mkp) +- elevator support for kiobuf request merging (axboe) +8.2. Zero-copy networking (Dave Miller) +8.3. SGI XFS - pagebuf patches - use of kiobufs +8.4. Multi-page pioent patch for bio (Christoph Hellwig) +8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11 +8.6. Async i/o implementation patch (Ben LaHaise) +8.7. EVMS layering design (IBM EVMS team) +8.8. Larger page cache size patch (Ben LaHaise) and + Large page size (Daniel Phillips) + => larger contiguous physical memory buffers +8.9. VM reservations patch (Ben LaHaise) +8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?) +8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+ +8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, + Badari) +8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven) +8.14 IDE Taskfile i/o patch (Andre Hedrick) +8.15 Multi-page writeout and readahead patches (Andrew Morton) +8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy) + +9. Other References: + +9.1 The Splice I/O Model - Larry McVoy (and subsequent discussions on lkml, +and Linus' comments - Jan 2001) +9.2 Discussions about kiobuf and bh design on lkml between sct, linus, alan +et al - Feb-March 2001 (many of the initial thoughts that led to bio were +brought up in this discusion thread) +9.3 Discussions on mempool on lkml - Dec 2001. + diff --git a/Documentation/block/deadline-iosched.txt b/Documentation/block/deadline-iosched.txt new file mode 100644 index 000000000000..c918b3a6022d --- /dev/null +++ b/Documentation/block/deadline-iosched.txt @@ -0,0 +1,78 @@ +Deadline IO scheduler tunables +============================== + +This little file attempts to document how the deadline io scheduler works. +In particular, it will clarify the meaning of the exposed tunables that may be +of interest to power users. + +Each io queue has a set of io scheduler tunables associated with it. These +tunables control how the io scheduler works. You can find these entries +in: + +/sys/block//queue/iosched + +assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted, +you can do so by typing: + +# mount none /sys -t sysfs + + +******************************************************************************** + + +read_expire (in ms) +----------- + +The goal of the deadline io scheduler is to attempt to guarentee a start +service time for a request. As we focus mainly on read latencies, this is +tunable. When a read request first enters the io scheduler, it is assigned +a deadline that is the current time + the read_expire value in units of +miliseconds. + + +write_expire (in ms) +----------- + +Similar to read_expire mentioned above, but for writes. + + +fifo_batch +---------- + +When a read request expires its deadline, we must move some requests from +the sorted io scheduler list to the block device dispatch queue. fifo_batch +controls how many requests we move, based on the cost of each request. A +request is either qualified as a seek or a stream. The io scheduler knows +the last request that was serviced by the drive (or will be serviced right +before this one). See seek_cost and stream_unit. + + +write_starved (number of dispatches) +------------- + +When we have to move requests from the io scheduler queue to the block +device dispatch queue, we always give a preference to reads. However, we +don't want to starve writes indefinitely either. So writes_starved controls +how many times we give preference to reads over writes. When that has been +done writes_starved number of times, we dispatch some writes based on the +same criteria as reads. + + +front_merges (bool) +------------ + +Sometimes it happens that a request enters the io scheduler that is contigious +with a request that is already on the queue. Either it fits in the back of that +request, or it fits at the front. That is called either a back merge candidate +or a front merge candidate. Due to the way files are typically laid out, +back merges are much more common than front merges. For some work loads, you +may even know that it is a waste of time to spend any time attempting to +front merge requests. Setting front_merges to 0 disables this functionality. +Front merges may still occur due to the cached last_merge hint, but since +that comes at basically 0 cost we leave that on. We simply disable the +rbtree front sector lookup when the io scheduler merge function is called. + + +Nov 11 2002, Jens Axboe + + diff --git a/Documentation/block/request.txt b/Documentation/block/request.txt new file mode 100644 index 000000000000..75924e2a6975 --- /dev/null +++ b/Documentation/block/request.txt @@ -0,0 +1,88 @@ + +struct request documentation + +Jens Axboe 27/05/02 + +1.0 +Index + +2.0 Struct request members classification + + 2.1 struct request members explanation + +3.0 + + +2.0 +Short explanation of request members + +Classification flags: + + D driver member + B block layer member + I I/O scheduler member + +Unless an entry contains a D classification, a device driver must not access +this member. Some members may contain D classifications, but should only be +access through certain macros or functions (eg ->flags). + + + +2.1 +Member Flag Comment +------ ---- ------- + +struct list_head queuelist BI Organization on various internal + queues + +void *elevator_private I I/O scheduler private data + +unsigned char cmd[16] D Driver can use this for setting up + a cdb before execution, see + blk_queue_prep_rq + +unsigned long flags DBI Contains info about data direction, + request type, etc. + +int rq_status D Request status bits + +kdev_t rq_dev DBI Target device + +int errors DB Error counts + +sector_t sector DBI Target location + +unsigned long hard_nr_sectors B Used to keep sector sane + +unsigned long nr_sectors DBI Total number of sectors in request + +unsigned long hard_nr_sectors B Used to keep nr_sectors sane + +unsigned short nr_phys_segments DB Number of physical scatter gather + segments in a request + +unsigned short nr_hw_segments DB Number of hardware scatter gather + segments in a request + +unsigned int current_nr_sectors DB Number of sectors in first segment + of request + +unsigned int hard_cur_sectors B Used to keep current_nr_sectors sane + +int tag DB TCQ tag, if assigned + +void *special D Free to be used by driver + +char *buffer D Map of first segment, also see + section on bouncing SECTION + +struct completion *waiting D Can be used by driver to get signalled + on request completion + +struct bio *bio DBI First bio in request + +struct bio *biotail DBI Last bio in request + +request_queue_t *q DB Request queue this request belongs to + +struct request_list *rl B Request list this request came from diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt new file mode 100644 index 000000000000..e132fb1163b0 --- /dev/null +++ b/Documentation/cachetlb.txt @@ -0,0 +1,384 @@ + Cache and TLB Flushing + Under Linux + + David S. Miller + +This document describes the cache/tlb flushing interfaces called +by the Linux VM subsystem. It enumerates over each interface, +describes it's intended purpose, and what side effect is expected +after the interface is invoked. + +The side effects described below are stated for a uniprocessor +implementation, and what is to happen on that single processor. The +SMP cases are a simple extension, in that you just extend the +definition such that the side effect for a particular interface occurs +on all processors in the system. Don't let this scare you into +thinking SMP cache/tlb flushing must be so inefficient, this is in +fact an area where many optimizations are possible. For example, +if it can be proven that a user address space has never executed +on a cpu (see vma->cpu_vm_mask), one need not perform a flush +for this address space on that cpu. + +First, the TLB flushing interfaces, since they are the simplest. The +"TLB" is abstracted under Linux as something the cpu uses to cache +virtual-->physical address translations obtained from the software +page tables. Meaning that if the software page tables change, it is +possible for stale translations to exist in this "TLB" cache. +Therefore when software page table changes occur, the kernel will +invoke one of the following flush methods _after_ the page table +changes occur: + +1) void flush_tlb_all(void) + + The most severe flush of all. After this interface runs, + any previous page table modification whatsoever will be + visible to the cpu. + + This is usually invoked when the kernel page tables are + changed, since such translations are "global" in nature. + +2) void flush_tlb_mm(struct mm_struct *mm) + + This interface flushes an entire user address space from + the TLB. After running, this interface must make sure that + any previous page table modifications for the address space + 'mm' will be visible to the cpu. That is, after running, + there will be no entries in the TLB for 'mm'. + + This interface is used to handle whole address space + page table operations such as what happens during + fork, and exec. + + Platform developers note that generic code will always + invoke this interface without mm->page_table_lock held. + +3) void flush_tlb_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end) + + Here we are flushing a specific range of (user) virtual + address translations from the TLB. After running, this + interface must make sure that any previous page table + modifications for the address space 'vma->vm_mm' in the range + 'start' to 'end-1' will be visible to the cpu. That is, after + running, here will be no entries in the TLB for 'mm' for + virtual addresses in the range 'start' to 'end-1'. + + The "vma" is the backing store being used for the region. + Primarily, this is used for munmap() type operations. + + The interface is provided in hopes that the port can find + a suitably efficient method for removing multiple page + sized translations from the TLB, instead of having the kernel + call flush_tlb_page (see below) for each entry which may be + modified. + + Platform developers note that generic code will always + invoke this interface with mm->page_table_lock held. + +4) void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) + + This time we need to remove the PAGE_SIZE sized translation + from the TLB. The 'vma' is the backing structure used by + Linux to keep track of mmap'd regions for a process, the + address space is available via vma->vm_mm. Also, one may + test (vma->vm_flags & VM_EXEC) to see if this region is + executable (and thus could be in the 'instruction TLB' in + split-tlb type setups). + + After running, this interface must make sure that any previous + page table modification for address space 'vma->vm_mm' for + user virtual address 'addr' will be visible to the cpu. That + is, after running, there will be no entries in the TLB for + 'vma->vm_mm' for virtual address 'addr'. + + This is used primarily during fault processing. + + Platform developers note that generic code will always + invoke this interface with mm->page_table_lock held. + +5) void flush_tlb_pgtables(struct mm_struct *mm, + unsigned long start, unsigned long end) + + The software page tables for address space 'mm' for virtual + addresses in the range 'start' to 'end-1' are being torn down. + + Some platforms cache the lowest level of the software page tables + in a linear virtually mapped array, to make TLB miss processing + more efficient. On such platforms, since the TLB is caching the + software page table structure, it needs to be flushed when parts + of the software page table tree are unlinked/freed. + + Sparc64 is one example of a platform which does this. + + Usually, when munmap()'ing an area of user virtual address + space, the kernel leaves the page table parts around and just + marks the individual pte's as invalid. However, if very large + portions of the address space are unmapped, the kernel frees up + those portions of the software page tables to prevent potential + excessive kernel memory usage caused by erratic mmap/mmunmap + sequences. It is at these times that flush_tlb_pgtables will + be invoked. + +6) void update_mmu_cache(struct vm_area_struct *vma, + unsigned long address, pte_t pte) + + At the end of every page fault, this routine is invoked to + tell the architecture specific code that a translation + described by "pte" now exists at virtual address "address" + for address space "vma->vm_mm", in the software page tables. + + A port may use this information in any way it so chooses. + For example, it could use this event to pre-load TLB + translations for software managed TLB configurations. + The sparc64 port currently does this. + +7) void tlb_migrate_finish(struct mm_struct *mm) + + This interface is called at the end of an explicit + process migration. This interface provides a hook + to allow a platform to update TLB or context-specific + information for the address space. + + The ia64 sn2 platform is one example of a platform + that uses this interface. + +8) void lazy_mmu_prot_update(pte_t pte) + This interface is called whenever the protection on + any user PTEs change. This interface provides a notification + to architecture specific code to take appropiate action. + + +Next, we have the cache flushing interfaces. In general, when Linux +is changing an existing virtual-->physical mapping to a new value, +the sequence will be in one of the following forms: + + 1) flush_cache_mm(mm); + change_all_page_tables_of(mm); + flush_tlb_mm(mm); + + 2) flush_cache_range(vma, start, end); + change_range_of_page_tables(mm, start, end); + flush_tlb_range(vma, start, end); + + 3) flush_cache_page(vma, addr, pfn); + set_pte(pte_pointer, new_pte_val); + flush_tlb_page(vma, addr); + +The cache level flush will always be first, because this allows +us to properly handle systems whose caches are strict and require +a virtual-->physical translation to exist for a virtual address +when that virtual address is flushed from the cache. The HyperSparc +cpu is one such cpu with this attribute. + +The cache flushing routines below need only deal with cache flushing +to the extent that it is necessary for a particular cpu. Mostly, +these routines must be implemented for cpus which have virtually +indexed caches which must be flushed when virtual-->physical +translations are changed or removed. So, for example, the physically +indexed physically tagged caches of IA32 processors have no need to +implement these interfaces since the caches are fully synchronized +and have no dependency on translation information. + +Here are the routines, one by one: + +1) void flush_cache_mm(struct mm_struct *mm) + + This interface flushes an entire user address space from + the caches. That is, after running, there will be no cache + lines associated with 'mm'. + + This interface is used to handle whole address space + page table operations such as what happens during + fork, exit, and exec. + +2) void flush_cache_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end) + + Here we are flushing a specific range of (user) virtual + addresses from the cache. After running, there will be no + entries in the cache for 'vma->vm_mm' for virtual addresses in + the range 'start' to 'end-1'. + + The "vma" is the backing store being used for the region. + Primarily, this is used for munmap() type operations. + + The interface is provided in hopes that the port can find + a suitably efficient method for removing multiple page + sized regions from the cache, instead of having the kernel + call flush_cache_page (see below) for each entry which may be + modified. + +3) void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) + + This time we need to remove a PAGE_SIZE sized range + from the cache. The 'vma' is the backing structure used by + Linux to keep track of mmap'd regions for a process, the + address space is available via vma->vm_mm. Also, one may + test (vma->vm_flags & VM_EXEC) to see if this region is + executable (and thus could be in the 'instruction cache' in + "Harvard" type cache layouts). + + The 'pfn' indicates the physical page frame (shift this value + left by PAGE_SHIFT to get the physical address) that 'addr' + translates to. It is this mapping which should be removed from + the cache. + + After running, there will be no entries in the cache for + 'vma->vm_mm' for virtual address 'addr' which translates + to 'pfn'. + + This is used primarily during fault processing. + +4) void flush_cache_kmaps(void) + + This routine need only be implemented if the platform utilizes + highmem. It will be called right before all of the kmaps + are invalidated. + + After running, there will be no entries in the cache for + the kernel virtual address range PKMAP_ADDR(0) to + PKMAP_ADDR(LAST_PKMAP). + + This routing should be implemented in asm/highmem.h + +5) void flush_cache_vmap(unsigned long start, unsigned long end) + void flush_cache_vunmap(unsigned long start, unsigned long end) + + Here in these two interfaces we are flushing a specific range + of (kernel) virtual addresses from the cache. After running, + there will be no entries in the cache for the kernel address + space for virtual addresses in the range 'start' to 'end-1'. + + The first of these two routines is invoked after map_vm_area() + has installed the page table entries. The second is invoked + before unmap_vm_area() deletes the page table entries. + +There exists another whole class of cpu cache issues which currently +require a whole different set of interfaces to handle properly. +The biggest problem is that of virtual aliasing in the data cache +of a processor. + +Is your port susceptible to virtual aliasing in it's D-cache? +Well, if your D-cache is virtually indexed, is larger in size than +PAGE_SIZE, and does not prevent multiple cache lines for the same +physical address from existing at once, you have this problem. + +If your D-cache has this problem, first define asm/shmparam.h SHMLBA +properly, it should essentially be the size of your virtually +addressed D-cache (or if the size is variable, the largest possible +size). This setting will force the SYSv IPC layer to only allow user +processes to mmap shared memory at address which are a multiple of +this value. + +NOTE: This does not fix shared mmaps, check out the sparc64 port for +one way to solve this (in particular SPARC_FLAG_MMAPSHARED). + +Next, you have to solve the D-cache aliasing issue for all +other cases. Please keep in mind that fact that, for a given page +mapped into some user address space, there is always at least one more +mapping, that of the kernel in it's linear mapping starting at +PAGE_OFFSET. So immediately, once the first user maps a given +physical page into its address space, by implication the D-cache +aliasing problem has the potential to exist since the kernel already +maps this page at its virtual address. + + void copy_user_page(void *to, void *from, unsigned long addr, struct page *page) + void clear_user_page(void *to, unsigned long addr, struct page *page) + + These two routines store data in user anonymous or COW + pages. It allows a port to efficiently avoid D-cache alias + issues between userspace and the kernel. + + For example, a port may temporarily map 'from' and 'to' to + kernel virtual addresses during the copy. The virtual address + for these two pages is chosen in such a way that the kernel + load/store instructions happen to virtual addresses which are + of the same "color" as the user mapping of the page. Sparc64 + for example, uses this technique. + + The 'addr' parameter tells the virtual address where the + user will ultimately have this page mapped, and the 'page' + parameter gives a pointer to the struct page of the target. + + If D-cache aliasing is not an issue, these two routines may + simply call memcpy/memset directly and do nothing more. + + void flush_dcache_page(struct page *page) + + Any time the kernel writes to a page cache page, _OR_ + the kernel is about to read from a page cache page and + user space shared/writable mappings of this page potentially + exist, this routine is called. + + NOTE: This routine need only be called for page cache pages + which can potentially ever be mapped into the address + space of a user process. So for example, VFS layer code + handling vfs symlinks in the page cache need not call + this interface at all. + + The phrase "kernel writes to a page cache page" means, + specifically, that the kernel executes store instructions + that dirty data in that page at the page->virtual mapping + of that page. It is important to flush here to handle + D-cache aliasing, to make sure these kernel stores are + visible to user space mappings of that page. + + The corollary case is just as important, if there are users + which have shared+writable mappings of this file, we must make + sure that kernel reads of these pages will see the most recent + stores done by the user. + + If D-cache aliasing is not an issue, this routine may + simply be defined as a nop on that architecture. + + There is a bit set aside in page->flags (PG_arch_1) as + "architecture private". The kernel guarantees that, + for pagecache pages, it will clear this bit when such + a page first enters the pagecache. + + This allows these interfaces to be implemented much more + efficiently. It allows one to "defer" (perhaps indefinitely) + the actual flush if there are currently no user processes + mapping this page. See sparc64's flush_dcache_page and + update_mmu_cache implementations for an example of how to go + about doing this. + + The idea is, first at flush_dcache_page() time, if + page->mapping->i_mmap is an empty tree and ->i_mmap_nonlinear + an empty list, just mark the architecture private page flag bit. + Later, in update_mmu_cache(), a check is made of this flag bit, + and if set the flush is done and the flag bit is cleared. + + IMPORTANT NOTE: It is often important, if you defer the flush, + that the actual flush occurs on the same CPU + as did the cpu stores into the page to make it + dirty. Again, see sparc64 for examples of how + to deal with this. + + void copy_to_user_page(struct vm_area_struct *vma, struct page *page, + unsigned long user_vaddr, + void *dst, void *src, int len) + void copy_from_user_page(struct vm_area_struct *vma, struct page *page, + unsigned long user_vaddr, + void *dst, void *src, int len) + When the kernel needs to copy arbitrary data in and out + of arbitrary user pages (f.e. for ptrace()) it will use + these two routines. + + Any necessary cache flushing or other coherency operations + that need to occur should happen here. If the processor's + instruction cache does not snoop cpu stores, it is very + likely that you will need to flush the instruction cache + for copy_to_user_page(). + + void flush_icache_range(unsigned long start, unsigned long end) + When the kernel stores into addresses that it will execute + out of (eg when loading modules), this function is called. + + If the icache does not snoop stores then this routine will need + to flush it. + + void flush_icache_page(struct vm_area_struct *vma, struct page *page) + All the functionality of flush_icache_page can be implemented in + flush_dcache_page and update_mmu_cache. In 2.7 the hope is to + remove this interface completely. diff --git a/Documentation/cciss.txt b/Documentation/cciss.txt new file mode 100644 index 000000000000..d599beb9df8a --- /dev/null +++ b/Documentation/cciss.txt @@ -0,0 +1,132 @@ +This driver is for Compaq's SMART Array Controllers. + +Supported Cards: +---------------- + +This driver is known to work with the following cards: + + * SA 5300 + * SA 5i + * SA 532 + * SA 5312 + * SA 641 + * SA 642 + * SA 6400 + * SA 6400 U320 Expansion Module + * SA 6i + * SA P600 + * SA P800 + * SA E400 + +If nodes are not already created in the /dev/cciss directory, run as root: + +# cd /dev +# ./MAKEDEV cciss + +Device Naming: +-------------- + +You need some entries in /dev for the cciss device. The MAKEDEV script +can make device nodes for you automatically. Currently the device setup +is as follows: + +Major numbers: + 104 cciss0 + 105 cciss1 + 106 cciss2 + 105 cciss3 + 108 cciss4 + 109 cciss5 + 110 cciss6 + 111 cciss7 + +Minor numbers: + b7 b6 b5 b4 b3 b2 b1 b0 + |----+----| |----+----| + | | + | +-------- Partition ID (0=wholedev, 1-15 partition) + | + +-------------------- Logical Volume number + +The device naming scheme is: +/dev/cciss/c0d0 Controller 0, disk 0, whole device +/dev/cciss/c0d0p1 Controller 0, disk 0, partition 1 +/dev/cciss/c0d0p2 Controller 0, disk 0, partition 2 +/dev/cciss/c0d0p3 Controller 0, disk 0, partition 3 + +/dev/cciss/c1d1 Controller 1, disk 1, whole device +/dev/cciss/c1d1p1 Controller 1, disk 1, partition 1 +/dev/cciss/c1d1p2 Controller 1, disk 1, partition 2 +/dev/cciss/c1d1p3 Controller 1, disk 1, partition 3 + +SCSI tape drive and medium changer support +------------------------------------------ + +SCSI sequential access devices and medium changer devices are supported and +appropriate device nodes are automatically created. (e.g. +/dev/st0, /dev/st1, etc. See the "st" man page for more details.) +You must enable "SCSI tape drive support for Smart Array 5xxx" and +"SCSI support" in your kernel configuration to be able to use SCSI +tape drives with your Smart Array 5xxx controller. + +Additionally, note that the driver will not engage the SCSI core at init +time. The driver must be directed to dynamically engage the SCSI core via +the /proc filesystem entry which the "block" side of the driver creates as +/proc/driver/cciss/cciss* at runtime. This is because at driver init time, +the SCSI core may not yet be initialized (because the driver is a block +driver) and attempting to register it with the SCSI core in such a case +would cause a hang. This is best done via an initialization script +(typically in /etc/init.d, but could vary depending on distibution). +For example: + + for x in /proc/driver/cciss/cciss[0-9]* + do + echo "engage scsi" > $x + done + +Once the SCSI core is engaged by the driver, it cannot be disengaged +(except by unloading the driver, if it happens to be linked as a module.) + +Note also that if no sequential access devices or medium changers are +detected, the SCSI core will not be engaged by the action of the above +script. + +Hot plug support for SCSI tape drives +------------------------------------- + +Hot plugging of SCSI tape drives is supported, with some caveats. +The cciss driver must be informed that changes to the SCSI bus +have been made, in addition to and prior to informing the SCSI +mid layer. This may be done via the /proc filesystem. For example: + + echo "rescan" > /proc/scsi/cciss0/1 + +This causes the adapter to query the adapter about changes to the +physical SCSI buses and/or fibre channel arbitrated loop and the +driver to make note of any new or removed sequential access devices +or medium changers. The driver will output messages indicating what +devices have been added or removed and the controller, bus, target and +lun used to address the device. Once this is done, the SCSI mid layer +can be informed of changes to the virtual SCSI bus which the driver +presents to it in the usual way. For example: + + echo scsi add-single-device 3 2 1 0 > /proc/scsi/scsi + +to add a device on controller 3, bus 2, target 1, lun 0. Note that +the driver makes an effort to preserve the devices positions +in the virtual SCSI bus, so if you are only moving tape drives +around on the same adapter and not adding or removing tape drives +from the adapter, informing the SCSI mid layer may not be necessary. + +Note that the naming convention of the /proc filesystem entries +contains a number in addition to the driver name. (E.g. "cciss0" +instead of just "cciss" which you might expect.) + +Note: ONLY sequential access devices and medium changers are presented +as SCSI devices to the SCSI mid layer by the cciss driver. Specifically, +physical SCSI disk drives are NOT presented to the SCSI mid layer. The +physical SCSI disk drives are controlled directly by the array controller +hardware and it is important to prevent the kernel from attempting to directly +access these devices too, as if the array controller were merely a SCSI +controller in the same way that we are allowing it to access SCSI tape drives. + diff --git a/Documentation/cdrom/00-INDEX b/Documentation/cdrom/00-INDEX new file mode 100644 index 000000000000..916dafe29d3f --- /dev/null +++ b/Documentation/cdrom/00-INDEX @@ -0,0 +1,33 @@ +00-INDEX + - this file (info on CD-ROMs and Linux) +Makefile + - only used to generate TeX output from the documentation. +aztcd + - info on Aztech/Orchid/Okano/Wearnes/Conrad/CyCDROM driver. +cdrom-standard.tex + - LaTeX document on standardizing the CD-ROM programming interface. +cdu31a + - info on the Sony CDU31A/CDU33A CD-ROM driver. +cm206 + - info on the Philips/LMS cm206/cm260 CD-ROM driver. +gscd + - info on the Goldstar R420 CD-ROM driver. +ide-cd + - info on setting up and using ATAPI (aka IDE) CD-ROMs. +isp16 + - info on the CD-ROM interface on ISP16, MAD16 or Mozart sound card. +mcd + - info on limitations of standard Mitsumi CD-ROM driver. +mcdx + - info on improved Mitsumi CD-ROM driver. +optcd + - info on the Optics Storage 8000 AT CD-ROM driver +packet-writing.txt + - Info on the CDRW packet writing module +sbpcd + - info on the SoundBlaster/Panasonic CD-ROM interface driver. +sjcd + - info on the SANYO CDR-H94A CD-ROM interface driver. +sonycd535 + - info on the Sony CDU-535 (and 531) CD-ROM driver. + diff --git a/Documentation/cdrom/Makefile b/Documentation/cdrom/Makefile new file mode 100644 index 000000000000..a19e321928e1 --- /dev/null +++ b/Documentation/cdrom/Makefile @@ -0,0 +1,21 @@ +LATEXFILE = cdrom-standard + +all: + make clean + latex $(LATEXFILE) + latex $(LATEXFILE) + @if [ -x `which gv` ]; then \ + `dvips -q -t letter -o $(LATEXFILE).ps $(LATEXFILE).dvi` ;\ + `gv -antialias -media letter -nocenter $(LATEXFILE).ps` ;\ + else \ + `xdvi $(LATEXFILE).dvi &` ;\ + fi + make sortofclean + +clean: + rm -f $(LATEXFILE).ps $(LATEXFILE).dvi $(LATEXFILE).aux $(LATEXFILE).log + +sortofclean: + rm -f $(LATEXFILE).aux $(LATEXFILE).log + + diff --git a/Documentation/cdrom/aztcd b/Documentation/cdrom/aztcd new file mode 100644 index 000000000000..6bf0290ef7ce --- /dev/null +++ b/Documentation/cdrom/aztcd @@ -0,0 +1,822 @@ +$Id: README.aztcd,v 2.60 1997/11/29 09:51:25 root Exp root $ + Readme-File Documentation/cdrom/aztcd + for + AZTECH CD-ROM CDA268-01A, ORCHID CD-3110, + OKANO/WEARNES CDD110, CONRAD TXC, CyCDROM CR520, CR540 + CD-ROM Drives + Version 2.6 and newer + (for other drives see 6.-8.) + +NOTE: THIS DRIVER WILL WORK WITH THE CD-ROM DRIVES LISTED, WHICH HAVE + A PROPRIETARY INTERFACE (implemented on a sound card or on an + ISA-AT-bus card). + IT WILL DEFINITELY NOT WORK WITH CD-ROM DRIVES WITH *IDE*-INTERFACE, + such as the Aztech CDA269-031SE !!! (The only known exceptions are + 'faked' IDE drives like the CyCDROM CR520ie which work with aztcd + under certain conditions, see 7.). IF YOU'RE USING A CD-ROM DRIVE + WITH IDE-INTERFACE, SOMETIMES ALSO CALLED ATAPI-COMPATIBLE, PLEASE + USE THE ide-cd.c DRIVER, WRITTEN BY MARK LORD AND SCOTT SNYDER ! + THE STANDARD-KERNEL 1.2.x NOW ALSO SUPPORTS IDE-CDROM-DRIVES, SEE THE + HARDDISK (!) SECTION OF make config, WHEN COMPILING A NEW KERNEL!!! +---------------------------------------------------------------------------- + +Contents of this file: + 1. NOTE + 2. INSTALLATION + 3. CONFIGURING YOUR KERNEL + 4. RECOMPILING YOUR KERNEL + 4.1 AZTCD AS A RUN-TIME LOADABLE MODULE + 4.2 CDROM CONNECTED TO A SOUNDCARD + 5. KNOWN PROBLEMS, FUTURE DEVELOPMENTS + 5.1 MULTISESSION SUPPORT + 5.2 STATUS RECOGNITION + 5.3 DOSEMU's CDROM SUPPORT + 6. BUG REPORTS + 7. OTHER DRIVES + 8. IF YOU DON'T SUCCEED ... DEBUGGING + 9. TECHNICAL HISTORY OF THE DRIVER + 10. ACKNOWLEDGMENTS + 11. PROGRAMMING ADD ONS: CDPLAY.C + APPENDIX: Source code of cdplay.c +---------------------------------------------------------------------------- + +1. NOTE +This software has been successfully in alpha and beta test and is part of +the standard kernel since kernel 1.1.8x since December 1994. It works with +AZTECH CDA268-01A, ORCHID CDS-3110, ORCHID/WEARNES CDD110 and CONRAD TXC +(Nr.99 31 23 -series 04) and has proven to be stable with kernel +versions 1.0.9 and newer. But with any software there still may be bugs in it. +So if you encounter problems, you are invited to help us improve this software. +Please send me a detailed bug report (see chapter BUG REPORTS). You are also +invited in helping us to increase the number of drives, which are supported. + +Please read the README-files carefully and always keep a backup copy of your +old kernel, in order to reboot if something goes wrong! + +2. INSTALLATION +The driver consists of a header file 'aztcd.h', which normally should reside +in /usr/src/linux/drivers/cdrom and the source code 'aztcd.c', which normally +resides in the same place. It uses /dev/aztcd (/dev/aztcd0 in some distri- +butions), which must be a valid block device with major number 29 and reside +in directory /dev. To mount a CD-ROM, your kernel needs to have the ISO9660- +filesystem support included. + +PLEASE NOTE: aztcd.c has been developed in parallel to the linux kernel, +which had and is having many major and minor changes which are not backward +compatible. Quite definitely aztcd.c version 1.80 and newer will NOT work +in kernels older than 1.3.33. So please always use the most recent version +of aztcd.c with the appropriate linux-kernel. + +3. CONFIGURING YOUR KERNEL +If your kernel is already configured for using the AZTECH driver you will +see the following message while Linux boots: + Aztech CD-ROM Init: DriverVersion= BaseAddress= + Aztech CD-ROM Init: FirmwareVersion=>> + Aztech CD-ROM Init: detected + Aztech CD-ROM Init: End +If the message looks different and you are sure to have a supported drive, +it may have a different base address. The Aztech driver does look for the +CD-ROM drive at the base address specified in aztcd.h at compile time. This +address can be overwritten by boot parameter aztcd=....You should reboot and +start Linux with boot parameter aztcd=, e.g. aztcd=0x320. If +you do not know the base address, start your PC with DOS and look at the boot +message of your CD-ROM's DOS driver. If that still does not help, use boot +parameter aztcd=,0x79 , this tells aztcd to try a little harder. +aztcd may be configured to use autoprobing the base address by recompiling +it (see chapter 4.). + +If the message looks correct, as user 'root' you should be able to mount the +drive by + mount -t iso9660 -r /dev/aztcd0 /mnt +and use it as any other filesystem. (If this does not work, check if +/dev/aztcd0 and /mnt do exist and create them, if necessary by doing + mknod /dev/aztcd0 b 29 0 + mkdir /mnt + +If you still get a different message while Linux boots or when you get the +message, that the ISO9660-filesystem is not supported by your kernel, when +you try to mount the CD-ROM drive, you have to recompile your kernel. + +If you do *not* have an Aztech/Orchid/Okano/Wearnes/TXC drive and want to +bypass drive detection during Linux boot up, start with boot parameter aztcd=0. + +Most distributions nowadays do contain a boot disk image containing aztcd. +Please note, that this driver will not work with IDE/ATAPI drives! With these +you must use ide-cd.c instead. + +4. RECOMPILING YOUR KERNEL +If your kernel is not yet configured for the AZTECH driver and the ISO9660- +filesystem, you have to recompile your kernel: + +- Edit aztcd.h to set the I/O-address to your I/O-Base address (AZT_BASE_ADDR), + the driver does not use interrupts or DMA, so if you are using an AZTECH + CD268, an ORCHID CD-3110 or ORCHID/WEARNES CDD110 that's the only item you + have to set up. If you have a soundcard, read chapter 4.2. + Users of other drives should read chapter OTHER DRIVES of this file. + You also can configure that address by kernel boot parameter aztcd=... +- aztcd may be configured to use autoprobing the base address by setting + AZT_BASE_ADDR to '-1'. In that case aztcd probes the addresses listed + under AZT_BASE_AUTO. But please remember, that autoprobing always may + incorrectly influence other hardware components too! +- There are some other points, which may be configured, e.g. auto-eject the + CD when unmounting a drive, tray locking etc., see aztcd.h for details. +- If you're using a linux kernel version prior to 2.1.0, in aztcd.h + uncomment the line '#define AZT_KERNEL_PRIOR_2_1' +- Build a new kernel, configure it for 'Aztech/Orchid/Okano/Wearnes support' + (if you want aztcd to be part of the kernel). Do not configure it for + 'Aztech... support', if you want to use aztcd as a run time loadable module. + But in any case you must have the ISO9660-filesystem included in your + kernel. +- Activate the new kernel, normally this is done by running LILO (don't for- + get to configure it before and to keep a copy of your old kernel in case + something goes wrong!). +- Reboot +- If you've included aztcd in your kernel, you now should see during boot + some messages like + Aztech CD-ROM Init: DriverVersion= BaseAddress= + Aztech CD-ROM Init: FirmwareVersion= + Aztech CD-ROM Init: detected + Aztech CD-ROM Init: End +- If you have not included aztcd in your kernel, but want to load aztcd as a + run time loadable module see 4.1. +- If the message looks correct, as user 'root' you should be able to mount + the drive by + mount -t iso9660 -r /dev/aztcd0 /mnt + and use it as any other filesystem. (If this does not work, check if + /dev/aztcd0 and /mnt do exist and create them, if necessary by doing + mknod /dev/aztcd0 b 29 0 + mkdir /mnt +- If this still does not help, see chapters OTHER DRIVES and DEBUGGING. + +4.1 AZTCD AS A RUN-TIME LOADABLE MODULE +If you do not need aztcd permanently, you can also load and remove the driver +during runtime via insmod and rmmod. To build aztcd as a loadable module you +must configure your kernel for AZTECH module support (answer 'm' when con- +figuring the kernel). Anyhow, you may run into problems, if the version of +your boot kernel is not the same than the source kernel version, from which +you create the modules. So rebuild your kernel, if necessary. + +Now edit the base address of your AZTECH interface card in +/usr/src/linux/drivers/cdrom/aztcd.h to the appropriate value. +aztcd may be configured to use autoprobing the base address by setting +AZT_BASE_ADDR to '-1'. In that case aztcd probes the addresses listed +under AZT_BASE_AUTO. But please remember, that autoprobing always may +incorrectly influence other hardware components too! +There are also some special features which may be configured, e.g. +auto-eject a CD when unmounting the drive etc; see aztcd.h for details. +Then change to /usr/src/linux and do a + make modules + make modules_install +After that you can run-time load the driver via + insmod /lib/modules/X.X.X/misc/aztcd.o +and remove it via rmmod aztcd. +If you did not set the correct base address in aztcd.h, you can also supply the +base address when loading the driver via + insmod /lib/modules/X.X.X/misc/aztcd.o aztcd= +Again specifying aztcd=-1 will cause autoprobing. +If you do not have the iso9660-filesystem in your boot kernel, you also have +to load it before you can mount the CDROM: + insmod /lib/modules/X.X.X/fs/isofs.o +The mount procedure works as described in 4. above. +(In all commands 'X.X.X' is the current linux kernel version number) + +4.2 CDROM CONNECTED TO A SOUNDCARD +Most soundcards do have a bus interface to the CDROM-drive. In many cases +this soundcard needs to be configured, before the CDROM can be used. This +configuration procedure consists of writing some kind of initialization +data to the soundcard registers. The AZTECH-CDROM driver in the moment does +only support one type of soundcard (SoundWave32). Users of other soundcards +should try to boot DOS first and let their DOS drivers initialize the +soundcard and CDROM, then warm boot (or use loadlin) their PC to start +Linux. +Support for the CDROM-interface of SoundWave32-soundcards is directly +implemented in the AZTECH driver. Please edit linux/drivers/cdrom/aztdc.h, +uncomment line '#define AZT_SW32' and set the appropriate value for +AZT_BASE_ADDR and AZT_SW32_BASE_ADDR. This support was tested with an Orchid +CDS-3110 connected to a SoundWave32. +If you want your soundcard to be supported, find out, how it needs to be +configured and mail me (see 6.) the appropriate information. + +5. KNOWN PROBLEMS, FUTURE DEVELOPMENTS +5.1 MULTISESSION SUPPORT +Multisession support for CD's still is a myth. I implemented and tested a basic +support for multisession and XA CDs, but I still have not enough CDs and appli- +cations to test it rigorously. So if you'd like to help me, please contact me +(Email address see below). As of version 1.4 and newer you can enable the +multisession support in aztcd.h by setting AZT_MULTISESSION to 1. Doing so +will cause the ISO9660-filesystem to deal with multisession CDs, ie. redirect +requests to the Table of Contents (TOC) information from the last session, +which contains the info of all previous sessions etc.. If you do set +AZT_MULTISESSION to 0, you can use multisession CDs anyway. In that case the +drive's firmware will do automatic redirection. For the ISO9660-filesystem any +multisession CD will then look like a 'normal' single session CD. But never- +theless the data of all sessions are viewable and accessible. So with practical- +ly all real world applications you won't notice the difference. But as future +applications may make use of advanced multisession features, I've started to +implement the interface for the ISO9660 multisession interface via ioctl +CDROMMULTISESSION. + +5.2 STATUS RECOGNITION +The drive status recognition does not work correctly in all cases. Changing +a disk or having the door open, when a drive is already mounted, is detected +by the Aztech driver itself, but nevertheless causes multiple read attempts +by the different layers of the ISO9660-filesystem driver, which finally timeout, +so you have to wait quite a little... But isn't it bad style to change a disk +in a mounted drive, anyhow ?! + +The driver uses busy wait in most cases for the drive handshake (macros +STEN_LOW and DTEN_LOW). I tested with a 486/DX2 at 66MHz and a Pentium at +60MHz and 90MHz. Whenever you use a much faster machine you are likely to get +timeout messages. In that case edit aztcd.h and increase the timeout value +AZT_TIMEOUT. + +For some 'slow' drive commands I implemented waiting with a timer waitqueue +(macro STEN_LOW_WAIT). If you get this timeout message, you may also edit +aztcd.h and increase the timeout value AZT_STATUS_DELAY. The waitqueue has +shown to be a little critical. If you get kernel panic messages, edit aztcd.c +and substitute STEN_LOW_WAIT by STEN_LOW. Busy waiting with STEN_LOW is more +stable, but also causes CPU overhead. + +5.3 DOSEMU's CD-ROM SUPPORT +With release 1.20 aztcd was modified to allow access to CD-ROMS when running +under dosemu-0.60.0 aztcd-versions before 1.20 are most likely to crash +Linux, when a CD-ROM is accessed under dosemu. This problem has partly been +fixed, but still when accessing a directory for the first time the system +might hang for some 30sec. So be patient, when using dosemu's CD-ROM support +in combination with aztcd :-) ! +This problem has now (July 1995) been fixed by a modification to dosemu's +CD-ROM driver. The new version came with dosemu-0.60.2, see dosemu's +README.CDROM. + +6. BUG REPORTS +Please send detailed bug reports and bug fixes via EMail to + + Werner.Zimmermann@fht-esslingen.de + +Please include a description of your CD-ROM drive type and interface card, +the exact firmware message during Linux bootup, the version number of the +AZTECH-CDROM-driver and the Linux kernel version. Also a description of your +system's other hardware could be of interest, especially microprocessor type, +clock frequency, other interface cards such as soundcards, ethernet adapter, +game cards etc.. + +I will try to collect the reports and make the necessary modifications from +time to time. I may also come back to you directly with some bug fixes and +ask you to do further testing and debugging. + +Editors of CD-ROMs are invited to send a 'cooperation' copy of their +CD-ROMs to the volunteers, who provided the CD-ROM support for Linux. My +snail mail address for such 'stuff' is + Prof. Dr. W. Zimmermann + Fachhochschule fuer Technik Esslingen + Fachbereich IT + Flandernstrasse 101 + D-73732 Esslingen + Germany + + +7. OTHER DRIVES +The following drives ORCHID CDS3110, OKANO CDD110, WEARNES CDD110 and Conrad +TXC Nr. 993123-series 04 nearly look the same as AZTECH CDA268-01A, especially +they seem to use the same command codes. So it was quite simple to make the +AZTECH driver work with these drives. + +Unfortunately I do not have any of these drives available, so I couldn't test +it myself. In some installations, it seems necessary to initialize the drive +with the DOS driver before (especially if combined with a sound card) and then +do a warm boot (CTRL-ALT-RESET) or start Linux from DOS, e.g. with 'loadlin'. + +If you do not succeed, read chapter DEBUGGING. Thanks in advance! + +Sorry for the inconvenience, but it is difficult to develop for hardware, +which you don't have available for testing. So if you like, please help us. + +If you do have a CyCDROM CR520ie thanks to Hilmar Berger's help your chances +are good, that it will work with aztcd. The CR520ie is sold as an IDE-drive +and really is connected to the IDE interface (primary at 0x1F0 or secondary +at 0x170, configured as slave, not as master). Nevertheless it is not ATAPI +compatible but still uses Aztech's command codes. + + +8. DEBUGGING : IF YOU DON'T SUCCEED, TRY THE FOLLOWING +-reread the complete README file +-make sure, that your drive is hardware configured for + transfer mode: polled + IRQ: not used + DMA: not used + Base Address: something like 300, 320 ... + You can check this, when you start the DOS driver, which came with your + drive. By appropriately configuring the drive and the DOS driver you can + check, whether your drive does operate in this mode correctly under DOS. If + it does not operate under DOS, it won't under Linux. + If your drive's base address is something like 0x170 or 0x1F0 (and it is + not a CyCDROM CR520ie or CR 940ie) you most likely are having an IDE/ATAPI- + compatible drive, which is not supported by aztcd.c, use ide-cd.c instead. + Make sure the Base Address is configured correctly in aztcd.h, also make + sure, that /dev/aztcd0 exists with the correct major number (compare it with + the entry in file /usr/include/linux/major.h for the Aztech drive). +-insert a CD-ROM and close the tray +-cold boot your PC (i.e. via the power on switch or the reset button) +-if you start Linux via DOS, e.g. using loadlin, make sure, that the DOS + driver for the CD-ROM drive is not loaded (comment out the calling lines + in DOS' config.sys!) +-look for the aztcd: init message during Linux init and note them exactly +-log in as root and do a mount -t iso9660 /dev/aztcd0 /mnt +-if you don't succeed in the first time, try several times. Try also to open + and close the tray, then mount again. Please note carefully all commands + you typed in and the aztcd-messages, which you get. +-if you get an 'Aztech CD-ROM init: aborted' message, read the remarks about + the version string below. + +If this does not help, do the same with the following differences +-start DOS before; make now sure, that the DOS driver for the CD-ROM is + loaded under DOS (i.e. uncomment it again in config.sys) +-warm boot your PC (i.e. via CTRL-ALT-DEL) + if you have it, you can also start via loadlin (try both). + ... + Again note all commands and the aztcd-messages. + +If you see STEN_LOW or STEN_LOW_WAIT error messages, increase the timeout +values. + +If this still does not help, +-look in aztcd.c for the lines #if 0 + #define AZT_TEST1 + ... + #endif + and substitute '#if 0' by '#if 1'. +-recompile your kernel and repeat the above two procedures. You will now get + a bundle of debugging messages from the driver. Again note your commands + and the appropriate messages. If you have syslogd running, these messages + may also be found in syslogd's kernel log file. Nevertheless in some + installations syslogd does not yet run, when init() is called, thus look for + the aztcd-messages during init, before the login-prompt appears. + Then look in aztcd.c, to find out, what happened. The normal calling sequence + is: aztcd_init() during Linux bootup procedure init() + after doing a 'mount -t iso9660 /dev/aztcd0 /mnt' the normal calling sequence is + aztcd_open() -> Status 2c after cold reboot with CDROM or audio CD inserted + -> Status 8 after warm reboot with CDROM inserted + -> Status 2e after cold reboot with no disk, closed tray + -> Status 6e after cold reboot, mount with door open + aztUpdateToc() + aztGetDiskInfo() + aztGetQChannelInfo() repeated several times + aztGetToc() + aztGetQChannelInfo() repeated several times + a list of track information + do_aztcd_request() } + azt_transfer() } repeated several times + azt_poll } + Check, if there is a difference in the calling sequence or the status flags! + + There are a lot of other messages, eg. the ACMD-command code (defined in + aztcd.h), status info from the getAztStatus-command and the state sequence of + the finite state machine in azt_poll(). The most important are the status + messages, look how they are defined and try to understand, if they make + sense in the context where they appear. With a CD-ROM inserted the status + should always be 8, except in aztcd_open(). Try to open the tray, insert an + audio disk, insert no disk or reinsert the CD-ROM and check, if the status + bits change accordingly. The status bits are the most likely point, where + the drive manufacturers may implement changes. + +If you still don't succeed, a good point to start is to look in aztcd.c in +function aztcd_init, where the drive should be detected during init. Do the +following: +-reboot the system with boot parameter 'aztcd=,0x79'. With + parameter 0x79 most of the drive version detection is bypassed. After that + you should see the complete version string including leading and trailing + blanks during init. + Now adapt the statement + if ((result[1]=='A')&&(result[2]=='Z' ...) + in aztcd_init() to exactly match the first 3 or 4 letters you have seen. +-Another point is the 'smart' card detection feature in aztcd_init(). Normally + the CD-ROM drive is ready, when aztcd_init is trying to read the version + string and a time consuming ACMD_SOFT_RESET command can be avoided. This is + detected by looking, if AFL_OP_OK can be read correctly. If the CD-ROM drive + hangs in some unknown state, e.g. because of an error before a warm start or + because you first operated under DOS, even the version string may be correct, + but the following commands will not. Then change the code in such a way, + that the ACMD_SOFT_RESET is issued in any case, by substituting the + if-statement 'if ( ...=AFL_OP_OK)' by 'if (1)'. + +If you succeed, please mail me the exact version string of your drive and +the code modifications, you have made together with a short explanation. +If you don't succeed, you may mail me the output of the debugging messages. +But remember, they are only useful, if they are exact and complete and you +describe in detail your hardware setup and what you did (cold/warm reboot, +with/without DOS, DOS-driver started/not started, which Linux-commands etc.) + + +9. TECHNICAL HISTORY OF THE DRIVER +The AZTECH-Driver is a rework of the Mitsumi-Driver. Four major items had to +be reworked: + +a) The Mitsumi drive does issue complete status information acknowledging +each command, the Aztech drive does only signal that the command was +processed. So whenever the complete status information is needed, an extra +ACMD_GET_STATUS command is issued. The handshake procedure for the drive +can be found in the functions aztSendCmd(), sendAztCmd() and getAztStatus(). + +b) The Aztech Drive does not have a ACMD_GET_DISK_INFO command, so the +necessary info about the number of tracks (firstTrack, lastTrack), disk +length etc. has to be read from the TOC in the lead in track (see function +aztGetDiskInfo()). + +c) Whenever data is read from the drive, the Mitsumi drive is started with a +command to read an indefinite (0xffffff) number of sectors. When the appropriate +number of sectors is read, the drive is stopped by a ACDM_STOP command. This +does not work with the Aztech drive. I did not find a way to stop it. The +stop and pause commands do only work in AUDIO mode but not in DATA mode. +Therefore I had to modify the 'finite state machine' in function azt_poll to +only read a certain number of sectors and then start a new read on demand. As I +have not completely understood, how the buffer/caching scheme of the Mitsumi +driver was implemented, I am not sure, if I have covered all cases correctly, +whenever you get timeout messages, the bug is most likely to be in that +function azt_poll() around switch(cmd) .... case ACD_S_DATA. + +d) I did not get information about changing drive mode. So I doubt, that the +code around function azt_poll() case AZT_S_MODE does work. In my test I have +not been able to switch to reading in raw mode. For reading raw mode, Aztech +uses a different command than for cooked mode, which I only have implemen- +ted in the ioctl-section but not in the section which is used by the ISO9660. + +The driver was developed on an AST PC with Intel 486/DX2, 8MB RAM, 340MB IDE +hard disk and on an AST PC with Intel Pentium 60MHz, 16MB RAM, 520MB IDE +running Linux kernel version 1.0.9 from the LST 1.8 Distribution. The kernel +was compiled with gcc.2.5.8. My CD-ROM drive is an Aztech CDA268-01A. My +drive says, that it has Firmware Version AZT26801A1.3. It came with an ISA-bus +interface card and works with polled I/O without DMA and without interrupts. +The code for all other drives was 'remote' tested and debugged by a number of +volunteers on the Internet. + +Points, where I feel that possible problems might be and all points where I +did not completely understand the drive's behaviour or trust my own code are +marked with /*???*/ in the source code. There are also some parts in the +Mitsumi driver, where I did not completely understand their code. + + +10. ACKNOWLEDGMENTS +Without the help of P.Bush, Aztech, who delivered technical information +about the Aztech Drive and without the help of E.Moenkeberg, GWDG, who did a +great job in analyzing the command structure of various CD-ROM drives, this +work would not have been possible. E.Moenkeberg was also a great help in +making the software 'kernel ready' and in answering many of the CDROM-related +questions in the newsgroups. He really is *the* Linux CD-ROM guru. Thanks +also to all the guys on the Internet, who collected valuable technical +information about CDROMs. + +Joe Nardone (joe@access.digex.net) was a patient tester even for my first +trial, which was more than slow, and made suggestions for code improvement. +Especially the 'finite state machine' azt_poll() was rewritten by Joe to get +clean C code and avoid the ugly 'gotos', which I copied from mcd.c. + +Robby Schirmer (schirmer@fmi.uni-passau.de) tested the audio stuff (ioctls) +and suggested a lot of patches for them. + +Joseph Piskor and Peter Nugent were the first users with the ORCHID CD3110 +and also were very patient with the problems which occurred. + +Reinhard Max delivered the information for the CDROM-interface of the +SoundWave32 soundcards. + +Jochen Kunz and Olaf Kaluza delivered the information for supporting Conrad's +TXC drive. + +Hilmar Berger delivered the patches for supporting CyCDROM CR520ie. + +Anybody, who is interested in these items should have a look at 'ftp.gwdg.de', +directory 'pub/linux/cdrom' and at 'ftp.cdrom.com', directory 'pub/cdrom'. + +11. PROGRAMMING ADD ONs: cdplay.c +You can use the ioctl-functions included in aztcd.c in your own programs. As +an example on how to do this, you will find a tiny CD Player for audio CDs +named 'cdplay.c'. It allows you to play audio CDs. You can play a specified +track, pause and resume or skip tracks forward and backwards. If you quit the +program without stopping the drive, playing is continued. You can also +(mis)use cdplay to read and hexdump data disks. You can find the code in the +APPENDIX of this file, which you should cut out with an editor and store in a +separate file 'cdplay.c'. To compile it and make it executable, do + gcc -s -Wall -O2 -L/usr/lib cdplay.c -o /usr/local/bin/cdplay # compiles it + chmod +755 /usr/local/bin/cdplay # makes it executable + ln -s /dev/aztcd0 /dev/cdrom # creates a link + (for /usr/lib substitute the top level directory, where your include files + reside, and for /usr/local/bin the directory, where you want the executable + binary to reside ) + +You have to set the correct permissions for cdplay *and* for /dev/mcd0 or +/dev/aztcd0 in order to use it. Remember, that you should not have /dev/cdrom +mounted, when you're playing audio CDs. + +This program is just a hack for testing the ioctl-functions in aztcd.c. I will +not maintain it, so if you run into problems, discard it or have a look into +the source code 'cdplay.c'. The program does only contain a minimum of user +protection and input error detection. If you use the commands in the wrong +order or if you try to read a CD at wrong addresses, you may get error messages +or even hang your machine. If you get STEN_LOW, STEN_LOW_WAIT or segment violation +error messages when using cdplay, after that, the system might not be stable +any more, so you'd better reboot. As the ioctl-functions run in kernel mode, +most normal Linux-multitasking protection features do not work. By using +uninitialized 'wild' pointers etc., it is easy to write to other users' data +and program areas, destroy kernel tables etc.. So if you experiment with ioctls +as always when you are doing systems programming and kernel hacking, you +should have a backup copy of your system in a safe place (and you also +should try restoring from a backup copy first)! + +A reworked and improved version called 'cdtester.c', which has yet more +features for testing CDROM-drives can be found in +Documentation/cdrom/sbpcd, written by E.Moenkeberg. + +Werner Zimmermann +Fachhochschule fuer Technik Esslingen +(EMail: Werner.Zimmermann@fht-esslingen.de) +October, 1997 + +--------------------------------------------------------------------------- +APPENDIX: Source code of cdplay.c + +/* Tiny Audio CD Player + + Copyright 1994, 1995, 1996 Werner Zimmermann (Werner.Zimmermann@fht-esslingen.de) + +This program originally was written to test the audio functions of the +AZTECH.CDROM-driver, but it should work with every CD-ROM drive. Before +using it, you should set a symlink from /dev/cdrom to your real CDROM +device. + +The GNU General Public License applies to this program. + +History: V0.1 W.Zimmermann: First release. Nov. 8, 1994 + V0.2 W.Zimmermann: Enhanced functionality. Nov. 9, 1994 + V0.3 W.Zimmermann: Additional functions. Nov. 28, 1994 + V0.4 W.Zimmermann: fixed some bugs. Dec. 17, 1994 + V0.5 W.Zimmermann: clean 'scanf' commands without compiler warnings + Jan. 6, 1995 + V0.6 W.Zimmermann: volume control (still experimental). Jan. 24, 1995 + V0.7 W.Zimmermann: read raw modified. July 26, 95 +*/ + +#include +#include +#include +#include +#include +#include +#include +#include + +void help(void) +{ printf("Available Commands: STOP s EJECT/CLOSE e QUIT q\n"); + printf(" PLAY TRACK t PAUSE p RESUME r\n"); + printf(" NEXT TRACK n REPEAT LAST l HELP h\n"); + printf(" SUB CHANNEL c TRACK INFO i PLAY AT a\n"); + printf(" READ d READ RAW w VOLUME v\n"); +} + +int main(void) +{ int handle; + unsigned char command=' ', ini=0, first=1, last=1; + unsigned int cmd, i,j,k, arg1,arg2,arg3; + struct cdrom_ti ti; + struct cdrom_tochdr tocHdr; + struct cdrom_subchnl subchnl; + struct cdrom_tocentry entry; + struct cdrom_msf msf; + union { struct cdrom_msf msf; + unsigned char buf[CD_FRAMESIZE_RAW]; + } azt; + struct cdrom_volctrl volctrl; + + printf("\nMini-Audio CD-Player V0.72 (C) 1994,1995,1996 W.Zimmermann\n"); + handle=open("/dev/cdrom",O_RDWR); + ioctl(handle,CDROMRESUME); + + if (handle<=0) + { printf("Drive Error: already playing, no audio disk, door open\n"); + printf(" or no permission (you must be ROOT in order to use this program)\n"); + } + else + { help(); + while (1) + { printf("Type command (h = help): "); + scanf("%s",&command); + switch (command) + { case 'e': cmd=CDROMEJECT; + ioctl(handle,cmd); + break; + case 'p': if (!ini) + { printf("Command not allowed - play track first\n"); + } + else + { cmd=CDROMPAUSE; + if (ioctl(handle,cmd)) printf("Drive Error\n"); + } + break; + case 'r': if (!ini) + { printf("Command not allowed - play track first\n"); + } + else + { cmd=CDROMRESUME; + if (ioctl(handle,cmd)) printf("Drive Error\n"); + } + break; + case 's': cmd=CDROMPAUSE; + if (ioctl(handle,cmd)) printf("Drive error or already stopped\n"); + cmd=CDROMSTOP; + if (ioctl(handle,cmd)) printf("Drive error\n"); + break; + case 't': cmd=CDROMREADTOCHDR; + if (ioctl(handle,cmd,&tocHdr)) printf("Drive Error\n"); + first=tocHdr.cdth_trk0; + last= tocHdr.cdth_trk1; + if ((first==0)||(first>last)) + { printf ("--could not read TOC\n"); + } + else + { printf("--first track: %d --last track: %d --enter track number: ",first,last); + cmd=CDROMPLAYTRKIND; + scanf("%i",&arg1); + ti.cdti_trk0=arg1; + if (ti.cdti_trk0last) ti.cdti_trk0=last; + ti.cdti_ind0=0; + ti.cdti_trk1=last; + ti.cdti_ind1=0; + if (ioctl(handle,cmd,&ti)) printf("Drive Error\n"); + ini=1; + } + break; + case 'n': if (!ini++) + { if (ioctl(handle,CDROMREADTOCHDR,&tocHdr)) printf("Drive Error\n"); + first=tocHdr.cdth_trk0; + last= tocHdr.cdth_trk1; + ti.cdti_trk0=first-1; + } + if ((first==0)||(first>last)) + { printf ("--could not read TOC\n"); + } + else + { cmd=CDROMPLAYTRKIND; + if (++ti.cdti_trk0 > last) ti.cdti_trk0=last; + ti.cdti_ind0=0; + ti.cdti_trk1=last; + ti.cdti_ind1=0; + if (ioctl(handle,cmd,&ti)) printf("Drive Error\n"); + ini=1; + } + break; + case 'l': if (!ini++) + { if (ioctl(handle,CDROMREADTOCHDR,&tocHdr)) printf("Drive Error\n"); + first=tocHdr.cdth_trk0; + last= tocHdr.cdth_trk1; + ti.cdti_trk0=first+1; + } + if ((first==0)||(first>last)) + { printf ("--could not read TOC\n"); + } + else + { cmd=CDROMPLAYTRKIND; + if (--ti.cdti_trk0 < first) ti.cdti_trk0=first; + ti.cdti_ind0=0; + ti.cdti_trk1=last; + ti.cdti_ind1=0; + if (ioctl(handle,cmd,&ti)) printf("Drive Error\n"); + ini=1; + } + break; + case 'c': subchnl.cdsc_format=CDROM_MSF; + if (ioctl(handle,CDROMSUBCHNL,&subchnl)) + printf("Drive Error\n"); + else + { printf("AudioStatus:%s Track:%d Mode:%d MSF=%d:%d:%d\n", \ + subchnl.cdsc_audiostatus==CDROM_AUDIO_PLAY ? "PLAYING":"NOT PLAYING",\ + subchnl.cdsc_trk,subchnl.cdsc_adr, \ + subchnl.cdsc_absaddr.msf.minute, subchnl.cdsc_absaddr.msf.second, \ + subchnl.cdsc_absaddr.msf.frame); + } + break; + case 'i': if (!ini) + { printf("Command not allowed - play track first\n"); + } + else + { cmd=CDROMREADTOCENTRY; + printf("Track No.: "); + scanf("%d",&arg1); + entry.cdte_track=arg1; + if (entry.cdte_tracklast) entry.cdte_track=last; + entry.cdte_format=CDROM_MSF; + if (ioctl(handle,cmd,&entry)) + { printf("Drive error or invalid track no.\n"); + } + else + { printf("Mode %d Track, starts at %d:%d:%d\n", \ + entry.cdte_adr,entry.cdte_addr.msf.minute, \ + entry.cdte_addr.msf.second,entry.cdte_addr.msf.frame); + } + } + break; + case 'a': cmd=CDROMPLAYMSF; + printf("Address (min:sec:frame) "); + scanf("%d:%d:%d",&arg1,&arg2,&arg3); + msf.cdmsf_min0 =arg1; + msf.cdmsf_sec0 =arg2; + msf.cdmsf_frame0=arg3; + if (msf.cdmsf_sec0 > 59) msf.cdmsf_sec0 =59; + if (msf.cdmsf_frame0> 74) msf.cdmsf_frame0=74; + msf.cdmsf_min1=60; + msf.cdmsf_sec1=00; + msf.cdmsf_frame1=00; + if (ioctl(handle,cmd,&msf)) + { printf("Drive error or invalid address\n"); + } + break; +#ifdef AZT_PRIVATE_IOCTLS /*not supported by every CDROM driver*/ + case 'd': cmd=CDROMREADCOOKED; + printf("Address (min:sec:frame) "); + scanf("%d:%d:%d",&arg1,&arg2,&arg3); + azt.msf.cdmsf_min0 =arg1; + azt.msf.cdmsf_sec0 =arg2; + azt.msf.cdmsf_frame0=arg3; + if (azt.msf.cdmsf_sec0 > 59) azt.msf.cdmsf_sec0 =59; + if (azt.msf.cdmsf_frame0> 74) azt.msf.cdmsf_frame0=74; + if (ioctl(handle,cmd,&azt.msf)) + { printf("Drive error, invalid address or unsupported command\n"); + } + k=0; + getchar(); + for (i=0;i<128;i++) + { printf("%4d:",i*16); + for (j=0;j<16;j++) + { printf("%2x ",azt.buf[i*16+j]); + } + for (j=0;j<16;j++) + { if (isalnum(azt.buf[i*16+j])) + printf("%c",azt.buf[i*16+j]); + else + printf("."); + } + printf("\n"); + k++; + if (k>=20) + { printf("press ENTER to continue\n"); + getchar(); + k=0; + } + } + break; + case 'w': cmd=CDROMREADRAW; + printf("Address (min:sec:frame) "); + scanf("%d:%d:%d",&arg1,&arg2,&arg3); + azt.msf.cdmsf_min0 =arg1; + azt.msf.cdmsf_sec0 =arg2; + azt.msf.cdmsf_frame0=arg3; + if (azt.msf.cdmsf_sec0 > 59) azt.msf.cdmsf_sec0 =59; + if (azt.msf.cdmsf_frame0> 74) azt.msf.cdmsf_frame0=74; + if (ioctl(handle,cmd,&azt)) + { printf("Drive error, invalid address or unsupported command\n"); + } + k=0; + for (i=0;i<147;i++) + { printf("%4d:",i*16); + for (j=0;j<16;j++) + { printf("%2x ",azt.buf[i*16+j]); + } + for (j=0;j<16;j++) + { if (isalnum(azt.buf[i*16+j])) + printf("%c",azt.buf[i*16+j]); + else + printf("."); + } + printf("\n"); + k++; + if (k>=20) + { getchar(); + k=0; + } + } + break; +#endif + case 'v': cmd=CDROMVOLCTRL; + printf("--Channel 0 Left (0-255): "); + scanf("%d",&arg1); + printf("--Channel 1 Right (0-255): "); + scanf("%d",&arg2); + volctrl.channel0=arg1; + volctrl.channel1=arg2; + volctrl.channel2=0; + volctrl.channel3=0; + if (ioctl(handle,cmd,&volctrl)) + { printf("Drive error or unsupported command\n"); + } + break; + case 'q': if (close(handle)) printf("Drive Error: CLOSE\n"); + exit(0); + case 'h': help(); + break; + default: printf("unknown command\n"); + break; + } + } + } + return 0; +} diff --git a/Documentation/cdrom/cdrom-standard.tex b/Documentation/cdrom/cdrom-standard.tex new file mode 100644 index 000000000000..92f94e597582 --- /dev/null +++ b/Documentation/cdrom/cdrom-standard.tex @@ -0,0 +1,1022 @@ +\documentclass{article} +\def\version{$Id: cdrom-standard.tex,v 1.9 1997/12/28 15:42:49 david Exp $} +\newcommand{\newsection}[1]{\newpage\section{#1}} + +\evensidemargin=0pt +\oddsidemargin=0pt +\topmargin=-\headheight \advance\topmargin by -\headsep +\textwidth=15.99cm \textheight=24.62cm % normal A4, 1'' margin + +\def\linux{{\sc Linux}} +\def\cdrom{{\sc cd-rom}} +\def\UCD{{\sc Uniform cd-rom Driver}} +\def\cdromc{{\tt {cdrom.c}}} +\def\cdromh{{\tt {cdrom.h}}} +\def\fo{\sl} % foreign words +\def\ie{{\fo i.e.}} +\def\eg{{\fo e.g.}} + +\everymath{\it} \everydisplay{\it} +\catcode `\_=\active \def_{\_\penalty100 } +\catcode`\<=\active \def<#1>{{\langle\hbox{\rm#1}\rangle}} + +\begin{document} +\title{A \linux\ \cdrom\ standard} +\author{David van Leeuwen\\{\normalsize\tt david@ElseWare.cistron.nl} +\\{\footnotesize updated by Erik Andersen {\tt(andersee@debian.org)}} +\\{\footnotesize updated by Jens Axboe {\tt(axboe@image.dk)}}} +\date{12 March 1999} + +\maketitle + +\newsection{Introduction} + +\linux\ is probably the Unix-like operating system that supports +the widest variety of hardware devices. The reasons for this are +presumably +\begin{itemize} +\item + The large list of hardware devices available for the many platforms + that \linux\ now supports (\ie, i386-PCs, Sparc Suns, etc.) +\item + The open design of the operating system, such that anybody can write a + driver for \linux. +\item + There is plenty of source code around as examples of how to write a driver. +\end{itemize} +The openness of \linux, and the many different types of available +hardware has allowed \linux\ to support many different hardware devices. +Unfortunately, the very openness that has allowed \linux\ to support +all these different devices has also allowed the behavior of each +device driver to differ significantly from one device to another. +This divergence of behavior has been very significant for \cdrom\ +devices; the way a particular drive reacts to a `standard' $ioctl()$ +call varies greatly from one device driver to another. To avoid making +their drivers totally inconsistent, the writers of \linux\ \cdrom\ +drivers generally created new device drivers by understanding, copying, +and then changing an existing one. Unfortunately, this practice did not +maintain uniform behavior across all the \linux\ \cdrom\ drivers. + +This document describes an effort to establish Uniform behavior across +all the different \cdrom\ device drivers for \linux. This document also +defines the various $ioctl$s, and how the low-level \cdrom\ device +drivers should implement them. Currently (as of the \linux\ 2.1.$x$ +development kernels) several low-level \cdrom\ device drivers, including +both IDE/ATAPI and SCSI, now use this Uniform interface. + +When the \cdrom\ was developed, the interface between the \cdrom\ drive +and the computer was not specified in the standards. As a result, many +different \cdrom\ interfaces were developed. Some of them had their +own proprietary design (Sony, Mitsumi, Panasonic, Philips), other +manufacturers adopted an existing electrical interface and changed +the functionality (CreativeLabs/SoundBlaster, Teac, Funai) or simply +adapted their drives to one or more of the already existing electrical +interfaces (Aztech, Sanyo, Funai, Vertos, Longshine, Optics Storage and +most of the `NoName' manufacturers). In cases where a new drive really +brought its own interface or used its own command set and flow control +scheme, either a separate driver had to be written, or an existing +driver had to be enhanced. History has delivered us \cdrom\ support for +many of these different interfaces. Nowadays, almost all new \cdrom\ +drives are either IDE/ATAPI or SCSI, and it is very unlikely that any +manufacturer will create a new interface. Even finding drives for the +old proprietary interfaces is getting difficult. + +When (in the 1.3.70's) I looked at the existing software interface, +which was expressed through \cdromh, it appeared to be a rather wild +set of commands and data formats.\footnote{I cannot recollect what +kernel version I looked at, then, presumably 1.2.13 and 1.3.34---the +latest kernel that I was indirectly involved in.} It seemed that many +features of the software interface had been added to accommodate the +capabilities of a particular drive, in an {\fo ad hoc\/} manner. More +importantly, it appeared that the behavior of the `standard' commands +was different for most of the different drivers: \eg, some drivers +close the tray if an $open()$ call occurs when the tray is open, while +others do not. Some drivers lock the door upon opening the device, to +prevent an incoherent file system, but others don't, to allow software +ejection. Undoubtedly, the capabilities of the different drives vary, +but even when two drives have the same capability their drivers' +behavior was usually different. + +I decided to start a discussion on how to make all the \linux\ \cdrom\ +drivers behave more uniformly. I began by contacting the developers of +the many \cdrom\ drivers found in the \linux\ kernel. Their reactions +encouraged me to write the \UCD\ which this document is intended to +describe. The implementation of the \UCD\ is in the file \cdromc. This +driver is intended to be an additional software layer that sits on top +of the low-level device drivers for each \cdrom\ drive. By adding this +additional layer, it is possible to have all the different \cdrom\ +devices behave {\em exactly\/} the same (insofar as the underlying +hardware will allow). + +The goal of the \UCD\ is {\em not\/} to alienate driver developers who +have not yet taken steps to support this effort. The goal of \UCD\ is +simply to give people writing application programs for \cdrom\ drives +{\em one\/} \linux\ \cdrom\ interface with consistent behavior for all +\cdrom\ devices. In addition, this also provides a consistent interface +between the low-level device driver code and the \linux\ kernel. Care +is taken that 100\,\% compatibility exists with the data structures and +programmer's interface defined in \cdromh. This guide was written to +help \cdrom\ driver developers adapt their code to use the \UCD\ code +defined in \cdromc. + +Personally, I think that the most important hardware interfaces are +the IDE/ATAPI drives and, of course, the SCSI drives, but as prices +of hardware drop continuously, it is also likely that people may have +more than one \cdrom\ drive, possibly of mixed types. It is important +that these drives behave in the same way. In December 1994, one of the +cheapest \cdrom\ drives was a Philips cm206, a double-speed proprietary +drive. In the months that I was busy writing a \linux\ driver for it, +proprietary drives became obsolete and IDE/ATAPI drives became the +standard. At the time of the last update to this document (November +1997) it is becoming difficult to even {\em find} anything less than a +16 speed \cdrom\ drive, and 24 speed drives are common. + +\newsection{Standardizing through another software level} +\label{cdrom.c} + +At the time this document was conceived, all drivers directly +implemented the \cdrom\ $ioctl()$ calls through their own routines. This +led to the danger of different drivers forgetting to do important things +like checking that the user was giving the driver valid data. More +importantly, this led to the divergence of behavior, which has already +been discussed. + +For this reason, the \UCD\ was created to enforce consistent \cdrom\ +drive behavior, and to provide a common set of services to the various +low-level \cdrom\ device drivers. The \UCD\ now provides another +software-level, that separates the $ioctl()$ and $open()$ implementation +from the actual hardware implementation. Note that this effort has +made few changes which will affect a user's application programs. The +greatest change involved moving the contents of the various low-level +\cdrom\ drivers' header files to the kernel's cdrom directory. This was +done to help ensure that the user is only presented with only one cdrom +interface, the interface defined in \cdromh. + +\cdrom\ drives are specific enough (\ie, different from other +block-devices such as floppy or hard disc drives), to define a set +of common {\em \cdrom\ device operations}, $_dops$. +These operations are different from the classical block-device file +operations, $_fops$. + +The routines for the \UCD\ interface level are implemented in the file +\cdromc. In this file, the \UCD\ interfaces with the kernel as a block +device by registering the following general $struct\ file_operations$: +$$ +\halign{$#$\ \hfil&$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr +struct& file_operations\ cdrom_fops = \{\hidewidth\cr + &NULL, & lseek \cr + &block_read, & read---general block-dev read \cr + &block_write, & write---general block-dev write \cr + &NULL, & readdir \cr + &NULL, & select \cr + &cdrom_ioctl, & ioctl \cr + &NULL, & mmap \cr + &cdrom_open, & open \cr + &cdrom_release, & release \cr + &NULL, & fsync \cr + &NULL, & fasync \cr + &cdrom_media_changed, & media change \cr + &NULL & revalidate \cr +\};\cr +} +$$ + +Every active \cdrom\ device shares this $struct$. The routines +declared above are all implemented in \cdromc, since this file is the +place where the behavior of all \cdrom-devices is defined and +standardized. The actual interface to the various types of \cdrom\ +hardware is still performed by various low-level \cdrom-device +drivers. These routines simply implement certain {\em capabilities\/} +that are common to all \cdrom\ (and really, all removable-media +devices). + +Registration of a low-level \cdrom\ device driver is now done through +the general routines in \cdromc, not through the Virtual File System +(VFS) any more. The interface implemented in \cdromc\ is carried out +through two general structures that contain information about the +capabilities of the driver, and the specific drives on which the +driver operates. The structures are: +\begin{description} +\item[$cdrom_device_ops$] + This structure contains information about the low-level driver for a + \cdrom\ device. This structure is conceptually connected to the major + number of the device (although some drivers may have different + major numbers, as is the case for the IDE driver). +\item[$cdrom_device_info$] + This structure contains information about a particular \cdrom\ drive, + such as its device name, speed, etc. This structure is conceptually + connected to the minor number of the device. +\end{description} + +Registering a particular \cdrom\ drive with the \UCD\ is done by the +low-level device driver though a call to: +$$register_cdrom(struct\ cdrom_device_info * _info) +$$ +The device information structure, $_info$, contains all the +information needed for the kernel to interface with the low-level +\cdrom\ device driver. One of the most important entries in this +structure is a pointer to the $cdrom_device_ops$ structure of the +low-level driver. + +The device operations structure, $cdrom_device_ops$, contains a list +of pointers to the functions which are implemented in the low-level +device driver. When \cdromc\ accesses a \cdrom\ device, it does it +through the functions in this structure. It is impossible to know all +the capabilities of future \cdrom\ drives, so it is expected that this +list may need to be expanded from time to time as new technologies are +developed. For example, CD-R and CD-R/W drives are beginning to become +popular, and support will soon need to be added for them. For now, the +current $struct$ is: +$$ +\halign{$#$\ \hfil&$#$\ \hfil&\hbox to 10em{$#$\hss}& + $/*$ \rm# $*/$\hfil\cr +struct& cdrom_device_ops\ \{ \hidewidth\cr + &int& (* open)(struct\ cdrom_device_info *, int)\cr + &void& (* release)(struct\ cdrom_device_info *);\cr + &int& (* drive_status)(struct\ cdrom_device_info *, int);\cr + &int& (* media_changed)(struct\ cdrom_device_info *, int);\cr + &int& (* tray_move)(struct\ cdrom_device_info *, int);\cr + &int& (* lock_door)(struct\ cdrom_device_info *, int);\cr + &int& (* select_speed)(struct\ cdrom_device_info *, int);\cr + &int& (* select_disc)(struct\ cdrom_device_info *, int);\cr + &int& (* get_last_session) (struct\ cdrom_device_info *, + struct\ cdrom_multisession *{});\cr + &int& (* get_mcn)(struct\ cdrom_device_info *, struct\ cdrom_mcn *{});\cr + &int& (* reset)(struct\ cdrom_device_info *);\cr + &int& (* audio_ioctl)(struct\ cdrom_device_info *, unsigned\ int, + void *{});\cr + &int& (* dev_ioctl)(struct\ cdrom_device_info *, unsigned\ int, + unsigned\ long);\cr +\noalign{\medskip} + &const\ int& capability;& capability flags \cr + &int& n_minors;& number of active minor devices \cr +\};\cr +} +$$ +When a low-level device driver implements one of these capabilities, +it should add a function pointer to this $struct$. When a particular +function is not implemented, however, this $struct$ should contain a +NULL instead. The $capability$ flags specify the capabilities of the +\cdrom\ hardware and/or low-level \cdrom\ driver when a \cdrom\ drive +is registered with the \UCD. The value $n_minors$ should be a positive +value indicating the number of minor devices that are supported by +the low-level device driver, normally~1. Although these two variables +are `informative' rather than `operational,' they are included in +$cdrom_device_ops$ because they describe the capability of the {\em +driver\/} rather than the {\em drive}. Nomenclature has always been +difficult in computer programming. + +Note that most functions have fewer parameters than their +$blkdev_fops$ counterparts. This is because very little of the +information in the structures $inode$ and $file$ is used. For most +drivers, the main parameter is the $struct$ $cdrom_device_info$, from +which the major and minor number can be extracted. (Most low-level +\cdrom\ drivers don't even look at the major and minor number though, +since many of them only support one device.) This will be available +through $dev$ in $cdrom_device_info$ described below. + +The drive-specific, minor-like information that is registered with +\cdromc, currently contains the following fields: +$$ +\halign{$#$\ \hfil&$#$\ \hfil&\hbox to 10em{$#$\hss}& + $/*$ \rm# $*/$\hfil\cr +struct& cdrom_device_info\ \{ \hidewidth\cr + & struct\ cdrom_device_ops *& ops;& device operations for this major\cr + & struct\ cdrom_device_info *& next;& next device_info for this major\cr + & void *& handle;& driver-dependent data\cr +\noalign{\medskip} + & kdev_t& dev;& device number (incorporates minor)\cr + & int& mask;& mask of capability: disables them \cr + & int& speed;& maximum speed for reading data \cr + & int& capacity;& number of discs in a jukebox \cr +\noalign{\medskip} + &int& options : 30;& options flags \cr + &unsigned& mc_flags : 2;& media-change buffer flags \cr + & int& use_count;& number of times device is opened\cr + & char& name[20];& name of the device type\cr +\}\cr +}$$ +Using this $struct$, a linked list of the registered minor devices is +built, using the $next$ field. The device number, the device operations +struct and specifications of properties of the drive are stored in this +structure. + +The $mask$ flags can be used to mask out some of the capabilities listed +in $ops\to capability$, if a specific drive doesn't support a feature +of the driver. The value $speed$ specifies the maximum head-rate of the +drive, measured in units of normal audio speed (176\,kB/sec raw data or +150\,kB/sec file system data). The value $n_discs$ should reflect the +number of discs the drive can hold simultaneously, if it is designed +as a juke-box, or otherwise~1. The parameters are declared $const$ +because they describe properties of the drive, which don't change after +registration. + +A few registers contain variables local to the \cdrom\ drive. The +flags $options$ are used to specify how the general \cdrom\ routines +should behave. These various flags registers should provide enough +flexibility to adapt to the different users' wishes (and {\em not\/} the +`arbitrary' wishes of the author of the low-level device driver, as is +the case in the old scheme). The register $mc_flags$ is used to buffer +the information from $media_changed()$ to two separate queues. Other +data that is specific to a minor drive, can be accessed through $handle$, +which can point to a data structure specific to the low-level driver. +The fields $use_count$, $next$, $options$ and $mc_flags$ need not be +initialized. + +The intermediate software layer that \cdromc\ forms will perform some +additional bookkeeping. The use count of the device (the number of +processes that have the device opened) is registered in $use_count$. The +function $cdrom_ioctl()$ will verify the appropriate user-memory regions +for read and write, and in case a location on the CD is transferred, +it will `sanitize' the format by making requests to the low-level +drivers in a standard format, and translating all formats between the +user-software and low level drivers. This relieves much of the drivers' +memory checking and format checking and translation. Also, the necessary +structures will be declared on the program stack. + +The implementation of the functions should be as defined in the +following sections. Two functions {\em must\/} be implemented, namely +$open()$ and $release()$. Other functions may be omitted, their +corresponding capability flags will be cleared upon registration. +Generally, a function returns zero on success and negative on error. A +function call should return only after the command has completed, but of +course waiting for the device should not use processor time. + +\subsection{$Int\ open(struct\ cdrom_device_info * cdi, int\ purpose)$} + +$Open()$ should try to open the device for a specific $purpose$, which +can be either: +\begin{itemize} +\item[0] Open for reading data, as done by {\tt {mount()}} (2), or the +user commands {\tt {dd}} or {\tt {cat}}. +\item[1] Open for $ioctl$ commands, as done by audio-CD playing +programs. +\end{itemize} +Notice that any strategic code (closing tray upon $open()$, etc.)\ is +done by the calling routine in \cdromc, so the low-level routine +should only be concerned with proper initialization, such as spinning +up the disc, etc. % and device-use count + + +\subsection{$Void\ release(struct\ cdrom_device_info * cdi)$} + + +Device-specific actions should be taken such as spinning down the device. +However, strategic actions such as ejection of the tray, or unlocking +the door, should be left over to the general routine $cdrom_release()$. +This is the only function returning type $void$. + +\subsection{$Int\ drive_status(struct\ cdrom_device_info * cdi, int\ slot_nr)$} +\label{drive status} + +The function $drive_status$, if implemented, should provide +information on the status of the drive (not the status of the disc, +which may or may not be in the drive). If the drive is not a changer, +$slot_nr$ should be ignored. In \cdromh\ the possibilities are listed: +$$ +\halign{$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr +CDS_NO_INFO& no information available\cr +CDS_NO_DISC& no disc is inserted, tray is closed\cr +CDS_TRAY_OPEN& tray is opened\cr +CDS_DRIVE_NOT_READY& something is wrong, tray is moving?\cr +CDS_DISC_OK& a disc is loaded and everything is fine\cr +} +$$ + +\subsection{$Int\ media_changed(struct\ cdrom_device_info * cdi, int\ disc_nr)$} + +This function is very similar to the original function in $struct\ +file_operations$. It returns 1 if the medium of the device $cdi\to +dev$ has changed since the last call, and 0 otherwise. The parameter +$disc_nr$ identifies a specific slot in a juke-box, it should be +ignored for single-disc drives. Note that by `re-routing' this +function through $cdrom_media_changed()$, we can implement separate +queues for the VFS and a new $ioctl()$ function that can report device +changes to software (\eg, an auto-mounting daemon). + +\subsection{$Int\ tray_move(struct\ cdrom_device_info * cdi, int\ position)$} + +This function, if implemented, should control the tray movement. (No +other function should control this.) The parameter $position$ controls +the desired direction of movement: +\begin{itemize} +\item[0] Close tray +\item[1] Open tray +\end{itemize} +This function returns 0 upon success, and a non-zero value upon +error. Note that if the tray is already in the desired position, no +action need be taken, and the return value should be 0. + +\subsection{$Int\ lock_door(struct\ cdrom_device_info * cdi, int\ lock)$} + +This function (and no other code) controls locking of the door, if the +drive allows this. The value of $lock$ controls the desired locking +state: +\begin{itemize} +\item[0] Unlock door, manual opening is allowed +\item[1] Lock door, tray cannot be ejected manually +\end{itemize} +This function returns 0 upon success, and a non-zero value upon +error. Note that if the door is already in the requested state, no +action need be taken, and the return value should be 0. + +\subsection{$Int\ select_speed(struct\ cdrom_device_info * cdi, int\ speed)$} + +Some \cdrom\ drives are capable of changing their head-speed. There +are several reasons for changing the speed of a \cdrom\ drive. Badly +pressed \cdrom s may benefit from less-than-maximum head rate. Modern +\cdrom\ drives can obtain very high head rates (up to $24\times$ is +common). It has been reported that these drives can make reading +errors at these high speeds, reducing the speed can prevent data loss +in these circumstances. Finally, some of these drives can +make an annoyingly loud noise, which a lower speed may reduce. %Finally, +%although the audio-low-pass filters probably aren't designed for it, +%more than real-time playback of audio might be used for high-speed +%copying of audio tracks. + +This function specifies the speed at which data is read or audio is +played back. The value of $speed$ specifies the head-speed of the +drive, measured in units of standard cdrom speed (176\,kB/sec raw data +or 150\,kB/sec file system data). So to request that a \cdrom\ drive +operate at 300\,kB/sec you would call the CDROM_SELECT_SPEED $ioctl$ +with $speed=2$. The special value `0' means `auto-selection', \ie, +maximum data-rate or real-time audio rate. If the drive doesn't have +this `auto-selection' capability, the decision should be made on the +current disc loaded and the return value should be positive. A negative +return value indicates an error. + +\subsection{$Int\ select_disc(struct\ cdrom_device_info * cdi, int\ number)$} + +If the drive can store multiple discs (a juke-box) this function +will perform disc selection. It should return the number of the +selected disc on success, a negative value on error. Currently, only +the ide-cd driver supports this functionality. + +\subsection{$Int\ get_last_session(struct\ cdrom_device_info * cdi, struct\ + cdrom_multisession * ms_info)$} + +This function should implement the old corresponding $ioctl()$. For +device $cdi\to dev$, the start of the last session of the current disc +should be returned in the pointer argument $ms_info$. Note that +routines in \cdromc\ have sanitized this argument: its requested +format will {\em always\/} be of the type $CDROM_LBA$ (linear block +addressing mode), whatever the calling software requested. But +sanitization goes even further: the low-level implementation may +return the requested information in $CDROM_MSF$ format if it wishes so +(setting the $ms_info\rightarrow addr_format$ field appropriately, of +course) and the routines in \cdromc\ will make the transformation if +necessary. The return value is 0 upon success. + +\subsection{$Int\ get_mcn(struct\ cdrom_device_info * cdi, struct\ + cdrom_mcn * mcn)$} + +Some discs carry a `Media Catalog Number' (MCN), also called +`Universal Product Code' (UPC). This number should reflect the number +that is generally found in the bar-code on the product. Unfortunately, +the few discs that carry such a number on the disc don't even use the +same format. The return argument to this function is a pointer to a +pre-declared memory region of type $struct\ cdrom_mcn$. The MCN is +expected as a 13-character string, terminated by a null-character. + +\subsection{$Int\ reset(struct\ cdrom_device_info * cdi)$} + +This call should perform a hard-reset on the drive (although in +circumstances that a hard-reset is necessary, a drive may very well not +listen to commands anymore). Preferably, control is returned to the +caller only after the drive has finished resetting. If the drive is no +longer listening, it may be wise for the underlying low-level cdrom +driver to time out. + +\subsection{$Int\ audio_ioctl(struct\ cdrom_device_info * cdi, unsigned\ + int\ cmd, void * arg)$} + +Some of the \cdrom-$ioctl$s defined in \cdromh\ can be +implemented by the routines described above, and hence the function +$cdrom_ioctl$ will use those. However, most $ioctl$s deal with +audio-control. We have decided to leave these to be accessed through a +single function, repeating the arguments $cmd$ and $arg$. Note that +the latter is of type $void*{}$, rather than $unsigned\ long\ +int$. The routine $cdrom_ioctl()$ does do some useful things, +though. It sanitizes the address format type to $CDROM_MSF$ (Minutes, +Seconds, Frames) for all audio calls. It also verifies the memory +location of $arg$, and reserves stack-memory for the argument. This +makes implementation of the $audio_ioctl()$ much simpler than in the +old driver scheme. For example, you may look up the function +$cm206_audio_ioctl()$ in {\tt {cm206.c}} that should be updated with +this documentation. + +An unimplemented ioctl should return $-ENOSYS$, but a harmless request +(\eg, $CDROMSTART$) may be ignored by returning 0 (success). Other +errors should be according to the standards, whatever they are. When +an error is returned by the low-level driver, the \UCD\ tries whenever +possible to return the error code to the calling program. (We may decide +to sanitize the return value in $cdrom_ioctl()$ though, in order to +guarantee a uniform interface to the audio-player software.) + +\subsection{$Int\ dev_ioctl(struct\ cdrom_device_info * cdi, unsigned\ int\ + cmd, unsigned\ long\ arg)$} + +Some $ioctl$s seem to be specific to certain \cdrom\ drives. That is, +they are introduced to service some capabilities of certain drives. In +fact, there are 6 different $ioctl$s for reading data, either in some +particular kind of format, or audio data. Not many drives support +reading audio tracks as data, I believe this is because of protection +of copyrights of artists. Moreover, I think that if audio-tracks are +supported, it should be done through the VFS and not via $ioctl$s. A +problem here could be the fact that audio-frames are 2352 bytes long, +so either the audio-file-system should ask for 75264 bytes at once +(the least common multiple of 512 and 2352), or the drivers should +bend their backs to cope with this incoherence (to which I would be +opposed). Furthermore, it is very difficult for the hardware to find +the exact frame boundaries, since there are no synchronization headers +in audio frames. Once these issues are resolved, this code should be +standardized in \cdromc. + +Because there are so many $ioctl$s that seem to be introduced to +satisfy certain drivers,\footnote{Is there software around that + actually uses these? I'd be interested!} any `non-standard' $ioctl$s +are routed through the call $dev_ioctl()$. In principle, `private' +$ioctl$s should be numbered after the device's major number, and not +the general \cdrom\ $ioctl$ number, {\tt {0x53}}. Currently the +non-supported $ioctl$s are: {\it CDROMREADMODE1, CDROMREADMODE2, + CDROMREADAUDIO, CDROMREADRAW, CDROMREADCOOKED, CDROMSEEK, + CDROMPLAY\-BLK and CDROM\-READALL}. + + +\subsection{\cdrom\ capabilities} +\label{capability} + +Instead of just implementing some $ioctl$ calls, the interface in +\cdromc\ supplies the possibility to indicate the {\em capabilities\/} +of a \cdrom\ drive. This can be done by ORing any number of +capability-constants that are defined in \cdromh\ at the registration +phase. Currently, the capabilities are any of: +$$ +\halign{$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr +CDC_CLOSE_TRAY& can close tray by software control\cr +CDC_OPEN_TRAY& can open tray\cr +CDC_LOCK& can lock and unlock the door\cr +CDC_SELECT_SPEED& can select speed, in units of $\sim$150\,kB/s\cr +CDC_SELECT_DISC& drive is juke-box\cr +CDC_MULTI_SESSION& can read sessions $>\rm1$\cr +CDC_MCN& can read Media Catalog Number\cr +CDC_MEDIA_CHANGED& can report if disc has changed\cr +CDC_PLAY_AUDIO& can perform audio-functions (play, pause, etc)\cr +CDC_RESET& hard reset device\cr +CDC_IOCTLS& driver has non-standard ioctls\cr +CDC_DRIVE_STATUS& driver implements drive status\cr +} +$$ +The capability flag is declared $const$, to prevent drivers from +accidentally tampering with the contents. The capability fags actually +inform \cdromc\ of what the driver can do. If the drive found +by the driver does not have the capability, is can be masked out by +the $cdrom_device_info$ variable $mask$. For instance, the SCSI \cdrom\ +driver has implemented the code for loading and ejecting \cdrom's, and +hence its corresponding flags in $capability$ will be set. But a SCSI +\cdrom\ drive might be a caddy system, which can't load the tray, and +hence for this drive the $cdrom_device_info$ struct will have set +the $CDC_CLOSE_TRAY$ bit in $mask$. + +In the file \cdromc\ you will encounter many constructions of the type +$$\it +if\ (cdo\rightarrow capability \mathrel\& \mathord{\sim} cdi\rightarrow mask + \mathrel{\&} CDC_) \ldots +$$ +There is no $ioctl$ to set the mask\dots The reason is that +I think it is better to control the {\em behavior\/} rather than the +{\em capabilities}. + +\subsection{Options} + +A final flag register controls the {\em behavior\/} of the \cdrom\ +drives, in order to satisfy different users' wishes, hopefully +independently of the ideas of the respective author who happened to +have made the drive's support available to the \linux\ community. The +current behavior options are: +$$ +\halign{$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr +CDO_AUTO_CLOSE& try to close tray upon device $open()$\cr +CDO_AUTO_EJECT& try to open tray on last device $close()$\cr +CDO_USE_FFLAGS& use $file_pointer\rightarrow f_flags$ to indicate + purpose for $open()$\cr +CDO_LOCK& try to lock door if device is opened\cr +CDO_CHECK_TYPE& ensure disc type is data if opened for data\cr +} +$$ + +The initial value of this register is $CDO_AUTO_CLOSE \mathrel| +CDO_USE_FFLAGS \mathrel| CDO_LOCK$, reflecting my own view on user +interface and software standards. Before you protest, there are two +new $ioctl$s implemented in \cdromc, that allow you to control the +behavior by software. These are: +$$ +\halign{$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr +CDROM_SET_OPTIONS& set options specified in $(int)\ arg$\cr +CDROM_CLEAR_OPTIONS& clear options specified in $(int)\ arg$\cr +} +$$ +One option needs some more explanation: $CDO_USE_FFLAGS$. In the next +newsection we explain what the need for this option is. + +A software package {\tt setcd}, available from the Debian distribution +and {\tt sunsite.unc.edu}, allows user level control of these flags. + +\newsection{The need to know the purpose of opening the \cdrom\ device} + +Traditionally, Unix devices can be used in two different `modes', +either by reading/writing to the device file, or by issuing +controlling commands to the device, by the device's $ioctl()$ +call. The problem with \cdrom\ drives, is that they can be used for +two entirely different purposes. One is to mount removable +file systems, \cdrom s, the other is to play audio CD's. Audio commands +are implemented entirely through $ioctl$s, presumably because the +first implementation (SUN?) has been such. In principle there is +nothing wrong with this, but a good control of the `CD player' demands +that the device can {\em always\/} be opened in order to give the +$ioctl$ commands, regardless of the state the drive is in. + +On the other hand, when used as a removable-media disc drive (what the +original purpose of \cdrom s is) we would like to make sure that the +disc drive is ready for operation upon opening the device. In the old +scheme, some \cdrom\ drivers don't do any integrity checking, resulting +in a number of i/o errors reported by the VFS to the kernel when an +attempt for mounting a \cdrom\ on an empty drive occurs. This is not a +particularly elegant way to find out that there is no \cdrom\ inserted; +it more-or-less looks like the old IBM-PC trying to read an empty floppy +drive for a couple of seconds, after which the system complains it +can't read from it. Nowadays we can {\em sense\/} the existence of a +removable medium in a drive, and we believe we should exploit that +fact. An integrity check on opening of the device, that verifies the +availability of a \cdrom\ and its correct type (data), would be +desirable. + +These two ways of using a \cdrom\ drive, principally for data and +secondarily for playing audio discs, have different demands for the +behavior of the $open()$ call. Audio use simply wants to open the +device in order to get a file handle which is needed for issuing +$ioctl$ commands, while data use wants to open for correct and +reliable data transfer. The only way user programs can indicate what +their {\em purpose\/} of opening the device is, is through the $flags$ +parameter (see {\tt {open(2)}}). For \cdrom\ devices, these flags aren't +implemented (some drivers implement checking for write-related flags, +but this is not strictly necessary if the device file has correct +permission flags). Most option flags simply don't make sense to +\cdrom\ devices: $O_CREAT$, $O_NOCTTY$, $O_TRUNC$, $O_APPEND$, and +$O_SYNC$ have no meaning to a \cdrom. + +We therefore propose to use the flag $O_NONBLOCK$ to indicate +that the device is opened just for issuing $ioctl$ +commands. Strictly, the meaning of $O_NONBLOCK$ is that opening and +subsequent calls to the device don't cause the calling process to +wait. We could interpret this as ``don't wait until someone has +inserted some valid data-\cdrom.'' Thus, our proposal of the +implementation for the $open()$ call for \cdrom s is: +\begin{itemize} +\item If no other flags are set than $O_RDONLY$, the device is opened +for data transfer, and the return value will be 0 only upon successful +initialization of the transfer. The call may even induce some actions +on the \cdrom, such as closing the tray. +\item If the option flag $O_NONBLOCK$ is set, opening will always be +successful, unless the whole device doesn't exist. The drive will take +no actions whatsoever. +\end{itemize} + +\subsection{And what about standards?} + +You might hesitate to accept this proposal as it comes from the +\linux\ community, and not from some standardizing institute. What +about SUN, SGI, HP and all those other Unix and hardware vendors? +Well, these companies are in the lucky position that they generally +control both the hardware and software of their supported products, +and are large enough to set their own standard. They do not have to +deal with a dozen or more different, competing hardware +configurations.\footnote{Incidentally, I think that SUN's approach to +mounting \cdrom s is very good in origin: under Solaris a +volume-daemon automatically mounts a newly inserted \cdrom\ under {\tt +{/cdrom/$$/}}. In my opinion they should have pushed this +further and have {\em every\/} \cdrom\ on the local area network be +mounted at the similar location, \ie, no matter in which particular +machine you insert a \cdrom, it will always appear at the same +position in the directory tree, on every system. When I wanted to +implement such a user-program for \linux, I came across the +differences in behavior of the various drivers, and the need for an +$ioctl$ informing about media changes.} + +We believe that using $O_NONBLOCK$ to indicate that a device is being opened +for $ioctl$ commands only can be easily introduced in the \linux\ +community. All the CD-player authors will have to be informed, we can +even send in our own patches to the programs. The use of $O_NONBLOCK$ +has most likely no influence on the behavior of the CD-players on +other operating systems than \linux. Finally, a user can always revert +to old behavior by a call to $ioctl(file_descriptor, CDROM_CLEAR_OPTIONS, +CDO_USE_FFLAGS)$. + +\subsection{The preferred strategy of $open()$} + +The routines in \cdromc\ are designed in such a way that run-time +configuration of the behavior of \cdrom\ devices (of {\em any\/} type) +can be carried out, by the $CDROM_SET/CLEAR_OPTIONS$ $ioctls$. Thus, various +modes of operation can be set: +\begin{description} +\item[$CDO_AUTO_CLOSE \mathrel| CDO_USE_FFLAGS \mathrel| CDO_LOCK$] This +is the default setting. (With $CDO_CHECK_TYPE$ it will be better, in the +future.) If the device is not yet opened by any other process, and if +the device is being opened for data ($O_NONBLOCK$ is not set) and the +tray is found to be open, an attempt to close the tray is made. Then, +it is verified that a disc is in the drive and, if $CDO_CHECK_TYPE$ is +set, that it contains tracks of type `data mode 1.' Only if all tests +are passed is the return value zero. The door is locked to prevent file +system corruption. If the drive is opened for audio ($O_NONBLOCK$ is +set), no actions are taken and a value of 0 will be returned. +\item[$CDO_AUTO_CLOSE \mathrel| CDO_AUTO_EJECT \mathrel| CDO_LOCK$] This +mimics the behavior of the current sbpcd-driver. The option flags are +ignored, the tray is closed on the first open, if necessary. Similarly, +the tray is opened on the last release, \ie, if a \cdrom\ is unmounted, +it is automatically ejected, such that the user can replace it. +\end{description} +We hope that these option can convince everybody (both driver +maintainers and user program developers) to adopt the new \cdrom\ +driver scheme and option flag interpretation. + +\newsection{Description of routines in \cdromc} + +Only a few routines in \cdromc\ are exported to the drivers. In this +new section we will discuss these, as well as the functions that `take +over' the \cdrom\ interface to the kernel. The header file belonging +to \cdromc\ is called \cdromh. Formerly, some of the contents of this +file were placed in the file {\tt {ucdrom.h}}, but this file has now been +merged back into \cdromh. + +\subsection{$Struct\ file_operations\ cdrom_fops$} + +The contents of this structure were described in section~\ref{cdrom.c}. +A pointer to this structure is assigned to the $fops$ field +of the $struct gendisk$. + +\subsection{$Int\ register_cdrom( struct\ cdrom_device_info\ * cdi)$} + +This function is used in about the same way one registers $cdrom_fops$ +with the kernel, the device operations and information structures, +as described in section~\ref{cdrom.c}, should be registered with the +\UCD: +$$ +register_cdrom(\&_info)); +$$ +This function returns zero upon success, and non-zero upon +failure. The structure $_info$ should have a pointer to the +driver's $_dops$, as in +$$ +\vbox{\halign{&$#$\hfil\cr +struct\ &cdrom_device_info\ _info = \{\cr +& _dops;\cr +&\ldots\cr +\}\cr +}}$$ +Note that a driver must have one static structure, $_dops$, while +it may have as many structures $_info$ as there are minor devices +active. $Register_cdrom()$ builds a linked list from these. + +\subsection{$Int\ unregister_cdrom(struct\ cdrom_device_info * cdi)$} + +Unregistering device $cdi$ with minor number $MINOR(cdi\to dev)$ removes +the minor device from the list. If it was the last registered minor for +the low-level driver, this disconnects the registered device-operation +routines from the \cdrom\ interface. This function returns zero upon +success, and non-zero upon failure. + +\subsection{$Int\ cdrom_open(struct\ inode * ip, struct\ file * fp)$} + +This function is not called directly by the low-level drivers, it is +listed in the standard $cdrom_fops$. If the VFS opens a file, this +function becomes active. A strategy is implemented in this routine, +taking care of all capabilities and options that are set in the +$cdrom_device_ops$ connected to the device. Then, the program flow is +transferred to the device_dependent $open()$ call. + +\subsection{$Void\ cdrom_release(struct\ inode *ip, struct\ file +*fp)$} + +This function implements the reverse-logic of $cdrom_open()$, and then +calls the device-dependent $release()$ routine. When the use-count has +reached 0, the allocated buffers are flushed by calls to $sync_dev(dev)$ +and $invalidate_buffers(dev)$. + + +\subsection{$Int\ cdrom_ioctl(struct\ inode *ip, struct\ file *fp, +unsigned\ int\ cmd, unsigned\ long\ arg)$} +\label{cdrom-ioctl} + +This function handles all the standard $ioctl$ requests for \cdrom\ +devices in a uniform way. The different calls fall into three +categories: $ioctl$s that can be directly implemented by device +operations, ones that are routed through the call $audio_ioctl()$, and +the remaining ones, that are presumable device-dependent. Generally, a +negative return value indicates an error. + +\subsubsection{Directly implemented $ioctl$s} +\label{ioctl-direct} + +The following `old' \cdrom-$ioctl$s are implemented by directly +calling device-operations in $cdrom_device_ops$, if implemented and +not masked: +\begin{description} +\item[CDROMMULTISESSION] Requests the last session on a \cdrom. +\item[CDROMEJECT] Open tray. +\item[CDROMCLOSETRAY] Close tray. +\item[CDROMEJECT_SW] If $arg\not=0$, set behavior to auto-close (close +tray on first open) and auto-eject (eject on last release), otherwise +set behavior to non-moving on $open()$ and $release()$ calls. +\item[CDROM_GET_MCN] Get the Media Catalog Number from a CD. +\end{description} + +\subsubsection{$Ioctl$s routed through $audio_ioctl()$} +\label{ioctl-audio} + +The following set of $ioctl$s are all implemented through a call to +the $cdrom_fops$ function $audio_ioctl()$. Memory checks and +allocation are performed in $cdrom_ioctl()$, and also sanitization of +address format ($CDROM_LBA$/$CDROM_MSF$) is done. +\begin{description} +\item[CDROMSUBCHNL] Get sub-channel data in argument $arg$ of type $struct\ +cdrom_subchnl *{}$. +\item[CDROMREADTOCHDR] Read Table of Contents header, in $arg$ of type +$struct\ cdrom_tochdr *{}$. +\item[CDROMREADTOCENTRY] Read a Table of Contents entry in $arg$ and +specified by $arg$ of type $struct\ cdrom_tocentry *{}$. +\item[CDROMPLAYMSF] Play audio fragment specified in Minute, Second, +Frame format, delimited by $arg$ of type $struct\ cdrom_msf *{}$. +\item[CDROMPLAYTRKIND] Play audio fragment in track-index format +delimited by $arg$ of type $struct\ \penalty-1000 cdrom_ti *{}$. +\item[CDROMVOLCTRL] Set volume specified by $arg$ of type $struct\ +cdrom_volctrl *{}$. +\item[CDROMVOLREAD] Read volume into by $arg$ of type $struct\ +cdrom_volctrl *{}$. +\item[CDROMSTART] Spin up disc. +\item[CDROMSTOP] Stop playback of audio fragment. +\item[CDROMPAUSE] Pause playback of audio fragment. +\item[CDROMRESUME] Resume playing. +\end{description} + +\subsubsection{New $ioctl$s in \cdromc} + +The following $ioctl$s have been introduced to allow user programs to +control the behavior of individual \cdrom\ devices. New $ioctl$ +commands can be identified by the underscores in their names. +\begin{description} +\item[CDROM_SET_OPTIONS] Set options specified by $arg$. Returns the +option flag register after modification. Use $arg = \rm0$ for reading +the current flags. +\item[CDROM_CLEAR_OPTIONS] Clear options specified by $arg$. Returns + the option flag register after modification. +\item[CDROM_SELECT_SPEED] Select head-rate speed of disc specified as + by $arg$ in units of standard cdrom speed (176\,kB/sec raw data or + 150\,kB/sec file system data). The value 0 means `auto-select', \ie, + play audio discs at real time and data discs at maximum speed. The value + $arg$ is checked against the maximum head rate of the drive found in the + $cdrom_dops$. +\item[CDROM_SELECT_DISC] Select disc numbered $arg$ from a juke-box. + First disc is numbered 0. The number $arg$ is checked against the + maximum number of discs in the juke-box found in the $cdrom_dops$. +\item[CDROM_MEDIA_CHANGED] Returns 1 if a disc has been changed since + the last call. Note that calls to $cdrom_media_changed$ by the VFS + are treated by an independent queue, so both mechanisms will detect + a media change once. For juke-boxes, an extra argument $arg$ + specifies the slot for which the information is given. The special + value $CDSL_CURRENT$ requests that information about the currently + selected slot be returned. +\item[CDROM_DRIVE_STATUS] Returns the status of the drive by a call to + $drive_status()$. Return values are defined in section~\ref{drive + status}. Note that this call doesn't return information on the + current playing activity of the drive; this can be polled through an + $ioctl$ call to $CDROMSUBCHNL$. For juke-boxes, an extra argument + $arg$ specifies the slot for which (possibly limited) information is + given. The special value $CDSL_CURRENT$ requests that information + about the currently selected slot be returned. +\item[CDROM_DISC_STATUS] Returns the type of the disc currently in the + drive. It should be viewed as a complement to $CDROM_DRIVE_STATUS$. + This $ioctl$ can provide \emph {some} information about the current + disc that is inserted in the drive. This functionality used to be + implemented in the low level drivers, but is now carried out + entirely in \UCD. + + The history of development of the CD's use as a carrier medium for + various digital information has lead to many different disc types. + This $ioctl$ is useful only in the case that CDs have \emph {only + one} type of data on them. While this is often the case, it is + also very common for CDs to have some tracks with data, and some + tracks with audio. Because this is an existing interface, rather + than fixing this interface by changing the assumptions it was made + under, thereby breaking all user applications that use this + function, the \UCD\ implements this $ioctl$ as follows: If the CD in + question has audio tracks on it, and it has absolutely no CD-I, XA, + or data tracks on it, it will be reported as $CDS_AUDIO$. If it has + both audio and data tracks, it will return $CDS_MIXED$. If there + are no audio tracks on the disc, and if the CD in question has any + CD-I tracks on it, it will be reported as $CDS_XA_2_2$. Failing + that, if the CD in question has any XA tracks on it, it will be + reported as $CDS_XA_2_1$. Finally, if the CD in question has any + data tracks on it, it will be reported as a data CD ($CDS_DATA_1$). + + This $ioctl$ can return: + $$ + \halign{$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr + CDS_NO_INFO& no information available\cr + CDS_NO_DISC& no disc is inserted, or tray is opened\cr + CDS_AUDIO& Audio disc (2352 audio bytes/frame)\cr + CDS_DATA_1& data disc, mode 1 (2048 user bytes/frame)\cr + CDS_XA_2_1& mixed data (XA), mode 2, form 1 (2048 user bytes)\cr + CDS_XA_2_2& mixed data (XA), mode 2, form 1 (2324 user bytes)\cr + CDS_MIXED& mixed audio/data disc\cr + } + $$ + For some information concerning frame layout of the various disc + types, see a recent version of \cdromh. + +\item[CDROM_CHANGER_NSLOTS] Returns the number of slots in a + juke-box. +\item[CDROMRESET] Reset the drive. +\item[CDROM_GET_CAPABILITY] Returns the $capability$ flags for the + drive. Refer to section \ref{capability} for more information on + these flags. +\item[CDROM_LOCKDOOR] Locks the door of the drive. $arg == \rm0$ + unlocks the door, any other value locks it. +\item[CDROM_DEBUG] Turns on debugging info. Only root is allowed + to do this. Same semantics as CDROM_LOCKDOOR. +\end{description} + +\subsubsection{Device dependent $ioctl$s} + +Finally, all other $ioctl$s are passed to the function $dev_ioctl()$, +if implemented. No memory allocation or verification is carried out. + +\newsection{How to update your driver} + +\begin{enumerate} +\item Make a backup of your current driver. +\item Get hold of the files \cdromc\ and \cdromh, they should be in + the directory tree that came with this documentation. +\item Make sure you include \cdromh. +\item Change the 3rd argument of $register_blkdev$ from +$\&_fops$ to $\&cdrom_fops$. +\item Just after that line, add the following to register with the \UCD: + $$register_cdrom(\&_info);$$ + Similarly, add a call to $unregister_cdrom()$ at the appropriate place. +\item Copy an example of the device-operations $struct$ to your + source, \eg, from {\tt {cm206.c}} $cm206_dops$, and change all + entries to names corresponding to your driver, or names you just + happen to like. If your driver doesn't support a certain function, + make the entry $NULL$. At the entry $capability$ you should list all + capabilities your driver currently supports. If your driver + has a capability that is not listed, please send me a message. +\item Copy the $cdrom_device_info$ declaration from the same example + driver, and modify the entries according to your needs. If your + driver dynamically determines the capabilities of the hardware, this + structure should also be declared dynamically. +\item Implement all functions in your $_dops$ structure, + according to prototypes listed in \cdromh, and specifications given + in section~\ref{cdrom.c}. Most likely you have already implemented + the code in a large part, and you will almost certainly need to adapt the + prototype and return values. +\item Rename your $_ioctl()$ function to $audio_ioctl$ and + change the prototype a little. Remove entries listed in the first + part in section~\ref{cdrom-ioctl}, if your code was OK, these are + just calls to the routines you adapted in the previous step. +\item You may remove all remaining memory checking code in the + $audio_ioctl()$ function that deals with audio commands (these are + listed in the second part of section~\ref{cdrom-ioctl}). There is no + need for memory allocation either, so most $case$s in the $switch$ + statement look similar to: + $$ + case\ CDROMREADTOCENTRY\colon get_toc_entry\bigl((struct\ + cdrom_tocentry *{})\ arg\bigr); + $$ +\item All remaining $ioctl$ cases must be moved to a separate + function, $_ioctl$, the device-dependent $ioctl$s. Note that + memory checking and allocation must be kept in this code! +\item Change the prototypes of $_open()$ and + $_release()$, and remove any strategic code (\ie, tray + movement, door locking, etc.). +\item Try to recompile the drivers. We advise you to use modules, both + for {\tt {cdrom.o}} and your driver, as debugging is much easier this + way. +\end{enumerate} + +\newsection{Thanks} + +Thanks to all the people involved. First, Erik Andersen, who has +taken over the torch in maintaining \cdromc\ and integrating much +\cdrom-related code in the 2.1-kernel. Thanks to Scott Snyder and +Gerd Knorr, who were the first to implement this interface for SCSI +and IDE-CD drivers and added many ideas for extension of the data +structures relative to kernel~2.0. Further thanks to Heiko Eissfeldt, +Thomas Quinot, Jon Tombs, Ken Pizzini, Eberhard M\"onkeberg and Andrew +Kroll, the \linux\ \cdrom\ device driver developers who were kind +enough to give suggestions and criticisms during the writing. Finally +of course, I want to thank Linus Torvalds for making this possible in +the first place. + +\vfill +$ \version\ $ +\eject +\end{document} diff --git a/Documentation/cdrom/cdu31a b/Documentation/cdrom/cdu31a new file mode 100644 index 000000000000..c0667da09c00 --- /dev/null +++ b/Documentation/cdrom/cdu31a @@ -0,0 +1,196 @@ + + CDU31A/CDU33A Driver Info + ------------------------- + +Information on the Sony CDU31A/CDU33A CDROM driver for the Linux +kernel. + + Corey Minyard (minyard@metronet.com) + + Colossians 3:17 + +Crude Table of Contents +----------------------- + + Setting Up the Hardware + Configuring the Kernel + Configuring as a Module + Driver Special Features + + +This device driver handles Sony CDU31A/CDU33A CDROM drives and +provides a complete block-level interface as well as an ioctl() +interface as specified in include/linux/cdrom.h). With this +interface, CDROMs can be accessed, standard audio CDs can be played +back normally, and CD audio information can be read off the drive. + +Note that this will only work for CDU31A/CDU33A drives. Some vendors +market their drives as CDU31A compatible. They lie. Their drives are +really CDU31A hardware interface compatible (they can plug into the +same card). They are not software compatible. + +Setting Up the Hardware +----------------------- + +The CDU31A driver is unable to safely tell if an interface card is +present that it can use because the interface card does not announce +its presence in any way besides placing 4 I/O locations in memory. It +used to just probe memory and attempt commands, but Linus wisely asked +me to remove that because it could really screw up other hardware in +the system. + +Because of this, you must tell the kernel where the drive interface +is, what interrupts are used, and possibly if you are on a PAS-16 +soundcard. + +If you have the Sony CDU31A/CDU33A drive interface card, the following +diagram will help you set it up. If you have another card, you are on +your own. You need to make sure that the I/O address and interrupt is +not used by another card in the system. You will need to know the I/O +address and interrupt you have set. Note that use of interrupts is +highly recommended, if possible, it really cuts down on CPU used. +Unfortunately, most soundcards do not support interrupts for their +CDROM interfaces. By default, the Sony interface card comes with +interrupts disabled. + + +----------+-----------------+----------------------+ + | JP1 | 34 Pin Conn | | + | JP2 +-----------------+ | + | JP3 | + | JP4 | + | +--+ + | | +-+ + | | | | External + | | | | Connector + | | | | + | | +-+ + | +--+ + | | + | +--------+ + | | + +------------------------------------------+ + + JP1 sets the Base Address, using the following settings: + + Address Pin 1 Pin 2 + ------- ----- ----- + 0x320 Short Short + 0x330 Short Open + 0x340 Open Short + 0x360 Open Open + + JP2 and JP3 configure the DMA channel; they must be set the same. + + DMA Pin 1 Pin 2 Pin 3 + --- ----- ----- ----- + 1 On Off On + 2 Off On Off + 3 Off Off On + + JP4 Configures the IRQ: + + IRQ Pin 1 Pin 2 Pin 3 Pin 4 + --- ----- ----- ----- ----- + 3 Off Off On Off + 4 Off Off* Off On + 5 On Off Off Off + 6 Off On Off Off + + The documentation states to set this for interrupt + 4, but I think that is a mistake. + +Note that if you have another interface card, you will need to look at +the documentation to find the I/O base address. This is specified to +the SLCD.SYS driver for DOS with the /B: parameter, so you can look at +you DOS driver setup to find the address, if necessary. + +Configuring the Kernel +---------------------- + +You must tell the kernel where the drive is at boot time. This can be +done at the Linux boot prompt, by using LILO, or by using Bootlin. +Note that this is no substitute for HOWTOs and LILO documentation, if +you are confused please read those for info on bootline configuration +and LILO. + +At the linux boot prompt, press the ALT key and add the following line +after the boot name (you can let the kernel boot, it will tell you the +default boot name while booting): + + cdu31a=,[,PAS] + +The base address needs to have "0x" in front of it, since it is in +hex. For instance, to configure a drive at address 320 on interrupt 5, +use the following: + + cdu31a=0x320,5 + +I use the following boot line: + + cdu31a=0x1f88,0,PAS + +because I have a PAS-16 which does not support interrupt for the +CDU31A interface. + +Adding this as an append line at the beginning of the /etc/lilo.conf +file will set it for lilo configurations. I have the following as the +first line in my lilo.conf file: + + append="cdu31a=0x1f88,0" + +I'm not sure how to set up Bootlin (I have never used it), if someone +would like to fill in this section please do. + + +Configuring as a Module +----------------------- + +The driver supports loading as a module. However, you must specify +the boot address and interrupt on the boot line to insmod. You can't +use modprobe to load it, since modprobe doesn't support setting +variables. + +Anyway, I use the following line to load my driver as a module + + /sbin/insmod /lib/modules/`uname -r`/misc/cdu31a.o cdu31a_port=0x1f88 + +You can set the following variables in the driver: + + cdu31a_port= - sets the base I/O. If hex, put 0x in + front of it. This must be specified. + + cdu31a_irq= - Sets the interrupt number. Leaving this + off will turn interrupts off. + + +Driver Special Features +----------------------- + +This section describes features beyond the normal audio and CD-ROM +functions of the drive. + +2048 byte buffer mode + +If a disk is mounted with -o block=2048, data is copied straight from +the drive data port to the buffer. Otherwise, the readahead buffer +must be involved to hold the other 1K of data when a 1K block +operation is done. Note that with 2048 byte blocks you cannot execute +files from the CD. + +XA compatibility + +The driver should support XA disks for both the CDU31A and CDU33A. It +does this transparently, the using program doesn't need to set it. + +Multi-Session + +A multi-session disk looks just like a normal disk to the user. Just +mount one normally, and all the data should be there. A special +thanks to Koen for help with this! + +Raw sector I/O + +Using the CDROMREADAUDIO it is possible to read raw audio and data +tracks. Both operations return 2352 bytes per sector. On the data +tracks, the first 12 bytes is not returned by the drive and the value +of that data is indeterminate. diff --git a/Documentation/cdrom/cm206 b/Documentation/cdrom/cm206 new file mode 100644 index 000000000000..810368f4f7c4 --- /dev/null +++ b/Documentation/cdrom/cm206 @@ -0,0 +1,185 @@ +This is the readme file for the driver for the Philips/LMS cdrom drive +cm206 in combination with the cm260 host adapter card. + + (c) 1995 David A. van Leeuwen + +Changes since version 0.99 +-------------------------- +- Interfacing to the kernel is routed though an extra interface layer, + cdrom.c. This allows runtime-configurable `behavior' of the cdrom-drive, + independent of the driver. + +Features since version 0.33 +--------------------------- +- Full audio support, that is, both workman, workbone and cdp work + now reasonably. Reading TOC still takes some time. xmcd has been + reported to run successfully. +- Made auto-probe code a little better, I hope + +Features since version 0.28 +--------------------------- +- Full speed transfer rate (300 kB/s). +- Minimum kernel memory usage for buffering (less than 3 kB). +- Multisession support. +- Tray locking. +- Statistics of driver accessible to the user. +- Module support. +- Auto-probing of adapter card's base port and irq line, + also configurable at boot time or module load time. + + +Decide how you are going to use the driver. There are two +options: + + (a) installing the driver as a resident part of the kernel + (b) compiling the driver as a loadable module + + Further, you must decide if you are going to specify the base port + address and the interrupt request line of the adapter card cm260 as + boot options for (a), module parameters for (b), use automatic + probing of these values, or hard-wire your adaptor card's settings + into the source code. If you don't care, you can choose + autoprobing, which is the default. In that case you can move on to + the next step. + +Compiling the kernel +-------------------- +1) move to /usr/src/linux and do a + + make config + + If you have chosen option (a), answer yes to CONFIG_CM206 and + CONFIG_ISO9660_FS. + + If you have chosen option (b), answer yes to CONFIG_MODVERSIONS + and no (!) to CONFIG_CM206 and CONFIG_ISO9660_FS. + +2) then do a + + make clean; make zImage; make modules + +3) do the usual things to install a new image (backup the old one, run + `rdev -R zImage 1', copy the new image in place, run lilo). Might + be `make zlilo'. + +Using the driver as a module +---------------------------- +If you will only occasionally use the cd-rom driver, you can choose +option (b), install as a loadable module. You may have to re-compile +the module when you upgrade the kernel to a new version. + +Since version 0.96, much of the functionality has been transferred to +a generic cdrom interface in the file cdrom.c. The module cm206.o +depends on cdrom.o. If the latter is not compiled into the kernel, +you must explicitly load it before cm206.o: + + insmod /usr/src/linux/modules/cdrom.o + +To install the module, you use the command, as root + + insmod /usr/src/linux/modules/cm206.o + +You can specify the base address on the command line as well as the irq +line to be used, e.g. + + insmod /usr/src/linux/modules/cm206.o cm206=0x300,11 + +The order of base port and irq line doesn't matter; if you specify only +one, the other will have the value of the compiled-in default. You +may also have to install the file-system module `iso9660.o', if you +didn't compile that into the kernel. + + +Using the driver as part of the kernel +-------------------------------------- +If you have chosen option (a), you can specify the base-port +address and irq on the lilo boot command line, e.g.: + + LILO: linux cm206=0x340,11 + +This assumes that your linux kernel image keyword is `linux'. +If you specify either IRQ (3--11) or base port (0x300--0x370), +auto probing is turned off for both settings, thus setting the +other value to the compiled-in default. + +Note that you can also put these parameters in the lilo configuration file: + +# linux config +image = /vmlinuz + root = /dev/hda1 + label = Linux + append = "cm206=0x340,11" + read-only + + +If module parameters and LILO config options don't work +------------------------------------------------------- +If autoprobing does not work, you can hard-wire the default values +of the base port address (CM206_BASE) and interrupt request line +(CM206_IRQ) into the file /usr/src/linux/drivers/cdrom/cm206.h. Change +the defines of CM206_IRQ and CM206_BASE. + + +Mounting the cdrom +------------------ +1) Make sure that the right device is installed in /dev. + + mknod /dev/cm206cd b 32 0 + +2) Make sure there is a mount point, e.g., /cdrom + + mkdir /cdrom + +3) mount using a command like this (run as root): + + mount -rt iso9660 /dev/cm206cd /cdrom + +4) For user-mounts, add a line in /etc/fstab + + /dev/cm206cd /cdrom iso9660 ro,noauto,user + + This will allow users to give the commands + + mount /cdrom + umount /cdrom + +If things don't work +-------------------- + +- Try to do a `dmesg' to find out if the driver said anything about + what is going wrong during the initialization. + +- Try to do a `dd if=/dev/cm206cd | od -tc | less' to read from the + CD. + +- Look in the /proc directory to see if `cm206' shows up under one of + `interrupts', `ioports', `devices' or `modules' (if applicable). + + +DISCLAIMER +---------- +I cannot guarantee that this driver works, or that the hardware will +not be harmed, although I consider it most unlikely. + +I hope that you'll find this driver in some way useful. + + David van Leeuwen + david@tm.tno.nl + +Note for Linux CDROM vendors +----------------------------- +You are encouraged to include this driver on your Linux CDROM. If +you do, you might consider sending me a free copy of that cd-rom. +You can contact me through my e-mail address, david@tm.tno.nl. +If this driver is compiled into a kernel to boot off a cdrom, +you should actually send me a free copy of that cd-rom. + +Copyright +--------- +The copyright of the cm206 driver for Linux is + + (c) 1995 David A. van Leeuwen + +The driver is released under the conditions of the GNU general public +license, which can be found in the file COPYING in the root of this +source tree. diff --git a/Documentation/cdrom/gscd b/Documentation/cdrom/gscd new file mode 100644 index 000000000000..d01ca36b5c43 --- /dev/null +++ b/Documentation/cdrom/gscd @@ -0,0 +1,60 @@ + Goldstar R420 CD-Rom device driver README + +For all kind of other information about the GoldStar R420 CDROM +and this Linux device driver see the WWW page: + + http://linux.rz.fh-hannover.de/~raupach + + + If you are the editor of a Linux CD, you should + enable gscd.c within your boot floppy kernel. Please, + send me one of your CDs for free. + + +This current driver version 0.4a only supports reading data from the disk. +Currently we have no audio and no multisession or XA support. +The polling interface is used, no DMA. + + +Sometimes the GoldStar R420 is sold in a 'Reveal Multimedia Kit'. This kit's +drive interface is compatible, too. + + +Installation +------------ + +Change to '/usr/src/linux/drivers/cdrom' and edit the file 'gscd.h'. Insert +the i/o address of your interface card. + +The default base address is 0x340. This will work for most applications. +Address selection is accomplished by jumpers PN801-1 to PN801-4 on the +GoldStar Interface Card. +Appropriate settings are: 0x300, 0x310, 0x320, 0x330, 0x340, 0x350, 0x360 +0x370, 0x380, 0x390, 0x3A0, 0x3B0, 0x3C0, 0x3D0, 0x3E0, 0x3F0 + +Then go back to '/usr/src/linux/' and 'make config' to build the new +configuration for your kernel. If you want to use the GoldStar driver +like a module, don't select 'GoldStar CDROM support'. By the way, you +have to include the iso9660 filesystem. + +Now start compiling the kernel with 'make zImage'. +If you want to use the driver as a module, you have to do 'make modules' +and 'make modules_install', additionally. +Install your new kernel as usual - maybe you do it with 'make zlilo'. + +Before you can use the driver, you have to + mknod /dev/gscd0 b 16 0 +to create the appropriate device file (you only need to do this once). + +If you use modules, you can try to insert the driver. +Say: 'insmod /usr/src/linux/modules/gscd.o' +or: 'insmod /usr/src/linux/modules/gscd.o gscd=
' +The driver should report its results. + +That's it! Mount a disk, i.e. 'mount -rt iso9660 /dev/gscd0 /cdrom' + +Feel free to report errors and suggestions to the following address. +Be sure, I'm very happy to receive your comments! + + Oliver Raupach Hannover, Juni 1995 +(raupach@nwfs1.rz.fh-hannover.de) diff --git a/Documentation/cdrom/ide-cd b/Documentation/cdrom/ide-cd new file mode 100644 index 000000000000..29721bfcde12 --- /dev/null +++ b/Documentation/cdrom/ide-cd @@ -0,0 +1,574 @@ +IDE-CD driver documentation +Originally by scott snyder (19 May 1996) +Carrying on the torch is: Erik Andersen +New maintainers (19 Oct 1998): Jens Axboe + +1. Introduction +--------------- + +The ide-cd driver should work with all ATAPI ver 1.2 to ATAPI 2.6 compliant +CDROM drives which attach to an IDE interface. Note that some CDROM vendors +(including Mitsumi, Sony, Creative, Aztech, and Goldstar) have made +both ATAPI-compliant drives and drives which use a proprietary +interface. If your drive uses one of those proprietary interfaces, +this driver will not work with it (but one of the other CDROM drivers +probably will). This driver will not work with `ATAPI' drives which +attach to the parallel port. In addition, there is at least one drive +(CyCDROM CR520ie) which attaches to the IDE port but is not ATAPI; +this driver will not work with drives like that either (but see the +aztcd driver). + +This driver provides the following features: + + - Reading from data tracks, and mounting ISO 9660 filesystems. + + - Playing audio tracks. Most of the CDROM player programs floating + around should work; I usually use Workman. + + - Multisession support. + + - On drives which support it, reading digital audio data directly + from audio tracks. The program cdda2wav can be used for this. + Note, however, that only some drives actually support this. + + - There is now support for CDROM changers which comply with the + ATAPI 2.6 draft standard (such as the NEC CDR-251). This additional + functionality includes a function call to query which slot is the + currently selected slot, a function call to query which slots contain + CDs, etc. A sample program which demonstrates this functionality is + appended to the end of this file. The Sanyo 3-disc changer + (which does not conform to the standard) is also now supported. + Please note the driver refers to the first CD as slot # 0. + + +2. Installation +--------------- + +0. The ide-cd relies on the ide disk driver. See + Documentation/ide.txt for up-to-date information on the ide + driver. + +1. Make sure that the ide and ide-cd drivers are compiled into the + kernel you're using. When configuring the kernel, in the section + entitled "Floppy, IDE, and other block devices", say either `Y' + (which will compile the support directly into the kernel) or `M' + (to compile support as a module which can be loaded and unloaded) + to the options: + + Enhanced IDE/MFM/RLL disk/cdrom/tape/floppy support + Include IDE/ATAPI CDROM support + + and `no' to + + Use old disk-only driver on primary interface + + Depending on what type of IDE interface you have, you may need to + specify additional configuration options. See + Documentation/ide.txt. + +2. You should also ensure that the iso9660 filesystem is either + compiled into the kernel or available as a loadable module. You + can see if a filesystem is known to the kernel by catting + /proc/filesystems. + +3. The CDROM drive should be connected to the host on an IDE + interface. Each interface on a system is defined by an I/O port + address and an IRQ number, the standard assignments being + 0x1f0 and 14 for the primary interface and 0x170 and 15 for the + secondary interface. Each interface can control up to two devices, + where each device can be a hard drive, a CDROM drive, a floppy drive, + or a tape drive. The two devices on an interface are called `master' + and `slave'; this is usually selectable via a jumper on the drive. + + Linux names these devices as follows. The master and slave devices + on the primary IDE interface are called `hda' and `hdb', + respectively. The drives on the secondary interface are called + `hdc' and `hdd'. (Interfaces at other locations get other letters + in the third position; see Documentation/ide.txt.) + + If you want your CDROM drive to be found automatically by the + driver, you should make sure your IDE interface uses either the + primary or secondary addresses mentioned above. In addition, if + the CDROM drive is the only device on the IDE interface, it should + be jumpered as `master'. (If for some reason you cannot configure + your system in this manner, you can probably still use the driver. + You may have to pass extra configuration information to the kernel + when you boot, however. See Documentation/ide.txt for more + information.) + +4. Boot the system. If the drive is recognized, you should see a + message which looks like + + hdb: NEC CD-ROM DRIVE:260, ATAPI CDROM drive + + If you do not see this, see section 5 below. + +5. You may want to create a symbolic link /dev/cdrom pointing to the + actual device. You can do this with the command + + ln -s /dev/hdX /dev/cdrom + + where X should be replaced by the letter indicating where your + drive is installed. + +6. You should be able to see any error messages from the driver with + the `dmesg' command. + + +3. Basic usage +-------------- + +An ISO 9660 CDROM can be mounted by putting the disc in the drive and +typing (as root) + + mount -t iso9660 /dev/cdrom /mnt/cdrom + +where it is assumed that /dev/cdrom is a link pointing to the actual +device (as described in step 5 of the last section) and /mnt/cdrom is +an empty directory. You should now be able to see the contents of the +CDROM under the /mnt/cdrom directory. If you want to eject the CDROM, +you must first dismount it with a command like + + umount /mnt/cdrom + +Note that audio CDs cannot be mounted. + +Some distributions set up /etc/fstab to always try to mount a CDROM +filesystem on bootup. It is not required to mount the CDROM in this +manner, though, and it may be a nuisance if you change CDROMs often. +You should feel free to remove the cdrom line from /etc/fstab and +mount CDROMs manually if that suits you better. + +Multisession and photocd discs should work with no special handling. +The hpcdtoppm package (ftp.gwdg.de:/pub/linux/hpcdtoppm/) may be +useful for reading photocds. + +To play an audio CD, you should first unmount and remove any data +CDROM. Any of the CDROM player programs should then work (workman, +workbone, cdplayer, etc.). Lacking anything else, you could use the +cdtester program in Documentation/cdrom/sbpcd. + +On a few drives, you can read digital audio directly using a program +such as cdda2wav. The only types of drive which I've heard support +this are Sony and Toshiba drives. You will get errors if you try to +use this function on a drive which does not support it. + +For supported changers, you can use the `cdchange' program (appended to +the end of this file) to switch between changer slots. Note that the +drive should be unmounted before attempting this. The program takes +two arguments: the CDROM device, and the slot number to which you wish +to change. If the slot number is -1, the drive is unloaded. + + +4. Compilation options +---------------------- + +There are a few additional options which can be set when compiling the +driver. Most people should not need to mess with any of these; they +are listed here simply for completeness. A compilation option can be +enabled by adding a line of the form `#define