CopyBitsQuickly.c by Denis G. Pelli Copyright ©1989-1995


/*
CopyBitsQuickly.c

CopyBitsQuickly.c is a dumb substitute for CopyBits that ignores the color
tables and palettes, simply copying the raw pixels without any translation. It's
for doing animations. (Try the demo Sandstorm.) Besides copying images, it can
also add or multiply them. At one time it copied much faster than CopyBits did,
but the latest timing (under System 7), by TimeVideo, indicates that they are of
approximately equal speed.

Apple's CopyBits is an Apple Macintosh Toolbox routine for copying images, and
is documented in Inside Macintosh Volumes I,V, and VI, and New Inside Macintosh:
"Imaging with QuickDraw". CopyBitsQuickly does not cause the Memory Manager to
move memory, and thus may be used in a VBL task.

I suggest that you use the higher-level interface provided by
the VideoToolbox CopyWindows.c, which saves you from getting your hands dirty
messing with pixmaps. You can just deal with windows and GWorlds.

The returned value is nonzero if an error occurred.

CopyBitsQuickly supports four modes:
€ srcCopy copies the source to the destination.
€ addOver adds the source to the destination. Both must have 8-bit pixels.
Overflow is ignored.
€ addOverParallel adds the source to the destination (4 bytes at a time), i.e.
parallel addition. Overflow may carry over into neighboring pixels within the image.
Supports all pixel sizes.
€ mulOver causes the source and destination to be multiplied, pixel by pixel. Both
must have 8-bit pixels. After multiplication, the product is divided by 128 and
stored in the destination. Overflow is ignored. All the arithmetic is unsigned.

RESTRICTIONS:
€ srcBits and dstBits must both have the same number of bits/pixel.
€ dstRect and srcRect must have the same size.
€ mode must be either srcCopy, addOver, addOverParallel, or mulOver.
€ maskRgn must be NULL.
€ If mode is addOver or mulOver then the pixel size must be 8 bits.
€ If CopyBitsQuickly detects a violation of any of these restrictions it will return
a nonzero value, indicating that an error occured.

RETURNED VALUE:
0 Success.
1 Illegal srcMode (only srcCopy, addOver, and mulOver are allowed). 
2 maskRgn!=NULL.
3 Source and destination rects are of unequal size.
4 After clipping there were no pixels to copy, or RectToAddress couldn't resolve 
	address of source or destination.
5 Source and destination have unequal pixel sizes.
6 We need 32-bit addressing but it's not available.
7 This mode requires 8-bit pixels and the supplied pixel size is not 8 bits.

ACKNOWLEDGEMENTS:
I learned the trick of using a switch() to jump into a loop from Bill Karsh's solution
to the April 1994 MacTech Programmer's Challenge.

LIMITATIONS:
€ If a Rect extends across multiple screens, only as much of the upper-left of
the Rect that's on one device will be used. The rest is clipped off.
€ When accessing a screen, CopyBitsQuickly() ought to, but doesn't, call
ShieldCursor() to remove the cursor from the part of the screen it's reading or
writing. Calling ShieldCursor would also have the desirable side effect of
informing nonstandard video devices, like the Radius PowerView, that the screen
has been updated. (NOTE: CopyWindows does this for you before calling CopyBitsQuickly.)

NOTE: For highest speed you should choose your srcRectPtr & dstRectPtr so that
the first point moved to and from each row begins at a memory address that is a
multiple of 4 bytes. The effect on speed is substantial, about 25%.

NOTE: If your computer boots in 24-bit mode, as set by the Memory Control Panel,
then the THINK C Debugger will crash if it's activated while you've temporarily
switched into 32-bit mode. So don't put any breakpoints in any section of code
that's bracketed by calls to SwapMMUMode() unless your computer booted up in
32-bit mode. If your computer boots in 32-bit mode then the calls to SwapMMUMode
do nothing, and you can put Debugger breakpoints anywhere.

BLOCKMOVEDATA: NOT FASTER

BlockMoveData is a new (as of System Update 3.0 to System 7) variant of
BlockMove that omits cache flushing. Issuing BlockMoveData() on earlier versions
of the operating system will invoke plain old BlockMove(). BlockMoveData, like
BlockMove, uses the MOVE16 instruction, on computers that have it, so it could
potentially be faster than the generic code that most compilers produce.
However, I haven't found any speed advantage on the PowerBook 170, Mac II, IIfx,
and Power Mac in my lab (none of which have the 68040, which is the only cpu
that has MOVE16). I haven't tried it on a Quadra. 

To my surprise, BlockMoveData isn't faster than my C loop on the Mac II, IIfx,
or Power Mac 6100/60, and is distinctly slower in a few cases (e.g. 1 bit mode
on Toby card), so I've disabled it. So the following 2 paragraphs are moot.

IGNORE: CopyBitsQuickly.c, if possible, now uses Apple's BlockMoveData() for
highest possible speed on all Macs. For best performance you should set your
Memory Control Panel to 32-bit addressing, and you should install Apple's System
Update 3.0, which requires System 7.1, or whatever System release supercedes it.
If your Mac is very old, e.g. a Mac II, you may need to install the freely
available MODE32 init in order to be able to enable 32-bit addressing.

IGNORE: I've disabled the use of BlockMoveData if the computer has a 68040
processor, yet is a Mac II. I do this because BlockMoveData crashes on my Radius
Rocket (68040 processor on a NuBus card) in my Mac II when handed addresses in
video memory, even though MODE32 is installed. Presumably this indicates that
the BlockMove routine is not 32-bit clean, despite the runtime patches installed
by MODE32. This is puzzling since BlockMove works fine accessing the same video
addresses in either 24 or 32-bit mode (with MODE32) without the Rocket.

Copyright ©1989-1995 Denis G. Pelli.

HISTORY:
1/89 dgp	Version 2.0: added support for PixMaps and multiple screens. Added checking.
6/89 dgp	Version 3.0: now use RectToAddress, which clips to one device.
10/89 dgp	Version 3.5: Improved resolution from longs to bytes.
10/89 dgp	Version 4.0: Added new mode: addOver
3/90  dgp	Version 4.01: Made cosmetic changes:
			renamed srcRect & dstRect to srcRectPtr and dstRectPtr.
			renamed srcAdd to addOver, to conform to CopyBits.
			added a few more comments to explain the initial clipping.
3/20/90	dgp	made compatible with MPW C.
4/20/90	dgp	now uses 32-bit addressing only if QD32 is present.
4/9/91	dgp	v 4.05: changed nudge from short to long, just to be safe
8/24/91	dgp	Made compatible with THINK C 5.0.
4/15/92	dgp	Updated CopyBitsQuickly's function header to Standard C style.
10/5/92 dgp	Dropped support for THINK C 4. Updated the documentation above.
12/2/92 dgp cosmetic changes
12/8/92 dgp fixed major gaffe introduced on 12/2/92: "case" prefix was 
			missing in switch statement. This caused CopyBitsQuickly to do nothing. 
1/31/93	dgp	Added new "multiplyQuickly" mode requested by Josh Solomon. Now 
			insist on 8-bit pixels for both addOver and multiplyQuickly modes.
2/18/93	js	added mulOver to list of allowed modes. (Oops! - dgp.) Works ok now.
2/18/93	dgp	Now return int, nonzero if error occurred.
7/9/93	dgp check for 32-bit addressing capability.
6/5/94	dgp Replaced all assembly code by portable C code of similar speed. Only call 
			SwapMMUMode() if we must. Give error if we need 32-bit mode and it's not 
			available. Documented the returned value.
6/7/94	dgp	Added code to use Apple's BlockMoveData() for highest
			possible speed on all Macs, but disabled it because it didn't turn
			out to be faster on the machines on which I've tested it: Mac II, IIfx,
			and Power Mac 6100/60.
6/7/94	dgp	Added new mode "addOverParallel" which accepts any pixelSize and adds
			source to destination very quickly by adding 4 bytes at a time.
6/14/94	dgp	can32 is now computed by calling TrapAvailable(_SwapMMUMode), which 
			returns the correct answer even on Macs with dirty ROMs.
*/
#include "VideoToolbox.h"
// The srcMode constants addOverParallel and mulOver are defined in VideoToolbox.h
#include 
#include 	// _SwapMMUMode
#if (THINK_C || THINK_CPLUS)
	// These THINK C options seem to have very little effect on the code produced.
	// However, if you don't disable "assign_registers" then one of the variables
	// declared "register" in srcCopyQuickly() fails to be assigned to a register.
	#pragma options(!assign_registers,honor_register,redundant_loads,defer_adjust)
	#pragma options(global_optimizer,gopt_induction,gopt_loop,gopt_cse,gopt_coloring)
#endif

typedef unsigned char *UPtr;

static void SrcCopyQuickly(UPtr Src,unsigned long srcinc,
	UPtr Dst,unsigned long dstinc,
	unsigned long bytes,unsigned long lines,Boolean do32);
static void SrcCopyQuickly2(UPtr Src,unsigned long srcinc,
	UPtr Dst,unsigned long dstinc,
	unsigned long bytes,unsigned long lines,Boolean do32);
static void AddOverParallel(UPtr Src,unsigned long srcinc,
	UPtr Dst,unsigned long dstinc,
	unsigned long bytes,unsigned long lines,Boolean do32);
static void AddOver8(UPtr Src,unsigned long srcinc,
	UPtr Dst,unsigned long dstinc,
	unsigned long bytes,unsigned long lines,Boolean do32);
static void MulOver8(UPtr Src,unsigned long srcinc,
	UPtr Dst,unsigned long dstinc,
	unsigned long bytes,unsigned long lines,Boolean do32);

int CopyBitsQuickly(BitMap *srcBits,BitMap *dstBits
	,Rect *srcRectPtr,Rect *dstRectPtr,long srcMode,RgnHandle maskRgn)
{
	UPtr Src,Dst;
	long srcinc,dstinc;
	unsigned long lines;
	short srcRowBytes,dstRowBytes,srcPixelSize,dstPixelSize,srcBitsOffset,dstBitsOffset;
	Rect mySrcRect,myDstRect;
	int dx,dy;
	long nudge,bytes;
	Boolean do32,useBlockMove;
	static Boolean can32,is32,wantBlockMove,firstTime=1;
	long error,addressing,machine,processor;

	srcMode&=0xffff;	// upper bits are used only by CopyWindows.
	if(srcMode != srcCopy && srcMode != addOver && srcMode != addOverParallel && srcMode != mulOver)
		return 1;
	if(maskRgn != NULL) return 2;

	/* clip the rect to be copied by the bounds of source and destination */
	mySrcRect=*srcRectPtr;
	myDstRect=*dstRectPtr;
	/* first make sure that srcRect and dstRect are the same size */
	if(mySrcRect.bottom-mySrcRect.top != myDstRect.bottom-myDstRect.top || 
		mySrcRect.right-mySrcRect.left != myDstRect.right-myDstRect.left) 
			return 3;
	dx=myDstRect.left-mySrcRect.left;
	dy=myDstRect.top-mySrcRect.top;
	/* clip myDstRect */
	Dst = RectToAddress((PixMap *)dstBits,&myDstRect,&dstRowBytes,&dstPixelSize,&dstBitsOffset);

	/*
	This prevents writing outside the destination.
	The cost is that part of the inside will not be written.
	The problem arises because this routine's code can only write whole bytes,
	and the boundary may be in the middle of a byte. So, rather than writing an
	extra fraction of a byte (outside the destination rect) we leave the byte
	alone and fail to update a small portion inside the destination rect.
	*/
	if(dstBitsOffset>0) {
		nudge=(7+dstBitsOffset)/8;
		dstBitsOffset -= nudge*8;
		Dst += nudge;
		myDstRect.left += nudge*8/dstPixelSize;
	}

	/* Copy any clipping of myDstRect over to mySrcRect */
	mySrcRect=myDstRect;
	OffsetRect(&mySrcRect,-dx,-dy);
	/* clip mySrcRect */
	Src=RectToAddress((PixMap *)srcBits,&mySrcRect
		,&srcRowBytes,&srcPixelSize,&srcBitsOffset);

	/* Copy any clipping of mySrcRect back to myDstRect */
	myDstRect=mySrcRect;
	OffsetRect(&myDstRect,dx,dy);
	Dst=RectToAddress((PixMap *)dstBits,&myDstRect
		,&dstRowBytes,&dstPixelSize,&dstBitsOffset);

	if(Src==NULL || Dst==NULL) return 4;
	if(srcPixelSize != dstPixelSize) return 5;
	bytes = mySrcRect.right - mySrcRect.left;	/* number of pixels per line */
	bytes *= srcPixelSize;						/* number of bits per line */
	bytes /= 8;									/* number of bytes per line */
	srcinc = srcRowBytes - bytes;				/* offset in bytes to beginning of next line */
	dstinc = dstRowBytes - bytes;
	lines=mySrcRect.bottom - mySrcRect.top;		/* number of lines */
	if(srcinc==0 && dstinc==0){
		bytes*=lines;
		lines=1;
	}
	if(firstTime){
		can32=TrapAvailable(_SwapMMUMode);
		addressing=0;
		error=Gestalt(gestaltAddressingModeAttr,&addressing);
		is32=addressing&(1<0xffffffUL || (unsigned long)Dst>0xffffffUL;
	do32=do32 && !is32;
	if(do32 && !can32)return 6;
	// Can't use traps if we switch 24/32-bit mode.
	useBlockMove=wantBlockMove && !do32 && srcPixelSize==dstPixelSize;

	switch(srcMode){
	case srcCopy:
		if(useBlockMove)SrcCopyQuickly2(Src,srcinc,Dst,dstinc,bytes,lines,do32);
		else SrcCopyQuickly(Src,srcinc,Dst,dstinc,bytes,lines,do32);
		break;
	case addOverParallel:
		if(srcPixelSize!=dstPixelSize)return 5;
		AddOverParallel(Src,srcinc,Dst,dstinc,bytes,lines,do32);
		break;
	case addOver:
		if(srcPixelSize!=8 || dstPixelSize!=8)return 7;
		AddOver8(Src,srcinc,Dst,dstinc,bytes,lines,do32);
		break;
	case mulOver:
		if(srcPixelSize!=8 || dstPixelSize!=8)return 7;
		MulOver8(Src,srcinc,Dst,dstinc,bytes,lines,do32);
		break;
	default:
		return 1;
		break;
	}
	return 0;
}

static void SrcCopyQuickly2(register UPtr Src,register unsigned long srcinc,
	register UPtr Dst,register unsigned long dstinc,
	unsigned long bytes,register unsigned long lines,Boolean do32)
{
	// See discussion of BlockMoveData at top of this file.

	do32;	/* dgp: prevent "unused argument" warning */
	srcinc+=bytes;
	dstinc+=bytes;
	for(;lines>0;lines--){
		BlockMoveData(Src,Dst,bytes);
		Src+=srcinc;
		Dst+=dstinc;
	}
}	

#define useMask 0
static void SrcCopyQuickly(UPtr xSrc,register unsigned long srcinc,
	UPtr xDst,register unsigned long dstinc,
	register unsigned long bytes,register unsigned long lines,Boolean do32)
{
	register unsigned long *SrcL=(unsigned long *)xSrc,*DstL=(unsigned long *)xDst;
	register long i;
	char mmumode=true32b;
	static unsigned long mask32[4]={0,0xff,0xffff,0xffffff};
	register unsigned long m=mask32[bytes&3];

	if(useMask){
		srcinc+=bytes&3;
		dstinc+=bytes&3;
	}
	if(do32)SwapMMUMode(&mmumode);	/* set 32-bit mode */
	
	for(;lines>0;lines--) { //may want to change this to  while (lines--) see blit.c
	
		i=bytes>>7;
		switch((bytes>>2)&31){
			for(;i>=0;i--){
							*DstL++ = *SrcL++;
				case 31:	*DstL++ = *SrcL++;
				case 30:	*DstL++ = *SrcL++;
				case 29:	*DstL++ = *SrcL++;
				case 28:	*DstL++ = *SrcL++;
				case 27:	*DstL++ = *SrcL++;
				case 26:	*DstL++ = *SrcL++;
				case 25:	*DstL++ = *SrcL++;
				case 24:	*DstL++ = *SrcL++;
				case 23:	*DstL++ = *SrcL++;
				case 22:	*DstL++ = *SrcL++;
				case 21:	*DstL++ = *SrcL++;
				case 20:	*DstL++ = *SrcL++;
				case 19:	*DstL++ = *SrcL++;
				case 18:	*DstL++ = *SrcL++;
				case 17:	*DstL++ = *SrcL++;
				case 16:	*DstL++ = *SrcL++;
				case 15:	*DstL++ = *SrcL++;
				case 14:	*DstL++ = *SrcL++;
				case 13:	*DstL++ = *SrcL++;
				case 12:	*DstL++ = *SrcL++;
				case 11:	*DstL++ = *SrcL++;
				case 10:	*DstL++ = *SrcL++;
				case 9:		*DstL++ = *SrcL++;
				case 8:		*DstL++ = *SrcL++;
				case 7:		*DstL++ = *SrcL++;
				case 6:		*DstL++ = *SrcL++;
				case 5:		*DstL++ = *SrcL++;
				case 4:		*DstL++ = *SrcL++;
				case 3:		*DstL++ = *SrcL++;
				case 2:		*DstL++ = *SrcL++;
				case 1:		*DstL++ = *SrcL++;
				case 0:;
			}
		}
		if(useMask){
			if(m) *DstL=(m & *SrcL) | (!m & *DstL);
		}else{
			if(bytes&2){
				*(unsigned short *)DstL=*(unsigned short *)SrcL;
				DstL=(unsigned long *)(1+(unsigned short *)DstL);
				SrcL=(unsigned long *)(1+(unsigned short *)SrcL);
			}
			if(bytes&1){
				*(unsigned char *)DstL=*(unsigned char *)SrcL;
				DstL=(unsigned long *)(1+(unsigned char *)DstL);
				SrcL=(unsigned long *)(1+(unsigned char *)SrcL);
			}
		}
		DstL=(unsigned long *)(dstinc+(unsigned char *)DstL);
		SrcL=(unsigned long *)(srcinc+(unsigned char *)SrcL);
	}
	if(do32)SwapMMUMode(&mmumode);	/* restore */
}

static void AddOverParallel(UPtr xSrc,register unsigned long srcinc,
	UPtr xDst,register unsigned long dstinc,
	register unsigned long bytes,register unsigned long lines,Boolean do32)
{
	register unsigned long *SrcL=(unsigned long *)xSrc,*DstL=(unsigned long *)xDst;
	register long i;
	char mmumode;

	mmumode=true32b;
	if(do32)SwapMMUMode(&mmumode);	/* set 32-bit mode */
	for(;lines>0;lines--) {
		i=bytes>>7;
		switch((bytes>>2)&31){
			for(;i>=0;i--){
							*DstL++ += *SrcL++;
				case 31:	*DstL++ += *SrcL++;
				case 30:	*DstL++ += *SrcL++;
				case 29:	*DstL++ += *SrcL++;
				case 28:	*DstL++ += *SrcL++;
				case 27:	*DstL++ += *SrcL++;
				case 26:	*DstL++ += *SrcL++;
				case 25:	*DstL++ += *SrcL++;
				case 24:	*DstL++ += *SrcL++;
				case 23:	*DstL++ += *SrcL++;
				case 22:	*DstL++ += *SrcL++;
				case 21:	*DstL++ += *SrcL++;
				case 20:	*DstL++ += *SrcL++;
				case 19:	*DstL++ += *SrcL++;
				case 18:	*DstL++ += *SrcL++;
				case 17:	*DstL++ += *SrcL++;
				case 16:	*DstL++ += *SrcL++;
				case 15:	*DstL++ += *SrcL++;
				case 14:	*DstL++ += *SrcL++;
				case 13:	*DstL++ += *SrcL++;
				case 12:	*DstL++ += *SrcL++;
				case 11:	*DstL++ += *SrcL++;
				case 10:	*DstL++ += *SrcL++;
				case 9:		*DstL++ += *SrcL++;
				case 8:		*DstL++ += *SrcL++;
				case 7:		*DstL++ += *SrcL++;
				case 6:		*DstL++ += *SrcL++;
				case 5:		*DstL++ += *SrcL++;
				case 4:		*DstL++ += *SrcL++;
				case 3:		*DstL++ += *SrcL++;
				case 2:		*DstL++ += *SrcL++;
				case 1:		*DstL++ += *SrcL++;
				case 0:;
			}
		}
		if(bytes&2){
			*(unsigned short *)DstL += *(unsigned short *)SrcL;
			DstL=(unsigned long *)(1+(unsigned short *)DstL);
			SrcL=(unsigned long *)(1+(unsigned short *)SrcL);
		}
		if(bytes&1){
			*(unsigned char *)DstL += *(unsigned char *)SrcL;
			DstL=(unsigned long *)(1+(unsigned char *)DstL);
			SrcL=(unsigned long *)(1+(unsigned char *)SrcL);
		}
		DstL=(unsigned long *)(dstinc+(unsigned char *)DstL);
		SrcL=(unsigned long *)(srcinc+(unsigned char *)SrcL);
	}
	if(do32)SwapMMUMode(&mmumode);	/* restore */
}

static void AddOver8(register UPtr Src,register unsigned long srcinc,
	register UPtr Dst,register unsigned long dstinc,
	register unsigned long bytes,register unsigned long lines,Boolean do32)
{
	register long i;
	char mmumode;

	mmumode=true32b;
	if(do32)SwapMMUMode(&mmumode);	/* set 32-bit mode */
	for(;lines>0;lines--) {
		i=bytes>>5;
		switch(bytes&31){
			for(;i>=0;i--){
							*Dst++ += *Src++;
				case 31:	*Dst++ += *Src++;
				case 30:	*Dst++ += *Src++;
				case 29:	*Dst++ += *Src++;
				case 28:	*Dst++ += *Src++;
				case 27:	*Dst++ += *Src++;
				case 26:	*Dst++ += *Src++;
				case 25:	*Dst++ += *Src++;
				case 24:	*Dst++ += *Src++;
				case 23:	*Dst++ += *Src++;
				case 22:	*Dst++ += *Src++;
				case 21:	*Dst++ += *Src++;
				case 20:	*Dst++ += *Src++;
				case 19:	*Dst++ += *Src++;
				case 18:	*Dst++ += *Src++;
				case 17:	*Dst++ += *Src++;
				case 16:	*Dst++ += *Src++;
				case 15:	*Dst++ += *Src++;
				case 14:	*Dst++ += *Src++;
				case 13:	*Dst++ += *Src++;
				case 12:	*Dst++ += *Src++;
				case 11:	*Dst++ += *Src++;
				case 10:	*Dst++ += *Src++;
				case 9:		*Dst++ += *Src++;
				case 8:		*Dst++ += *Src++;
				case 7:		*Dst++ += *Src++;
				case 6:		*Dst++ += *Src++;
				case 5:		*Dst++ += *Src++;
				case 4:		*Dst++ += *Src++;
				case 3:		*Dst++ += *Src++;
				case 2:		*Dst++ += *Src++;
				case 1:		*Dst++ += *Src++;
				case 0:;
			}
		}
		Src += srcinc;
		Dst += dstinc;
	}
	if(do32)SwapMMUMode(&mmumode);	/* restore */
}

// Multiply two unsigned 8-bit pixels, and divide the product by 128.
static void MulOver8(register UPtr Src,register unsigned long srcinc,
	register UPtr Dst,register unsigned long dstinc,
	register unsigned long bytes,register unsigned long lines,Boolean do32)
{
	register long i;
	char mmumode;

	mmumode=true32b;
	if(do32)SwapMMUMode(&mmumode);	/* set 32-bit mode */
	for(;lines>0;lines--) {
		i=bytes>>4;
		switch(bytes&15){
			for(;i>=0;i--){
							*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 15:	*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 14:	*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 13:	*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 12:	*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 11:	*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 10:	*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 9:		*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 8:		*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 7:		*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 6:		*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 5:		*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 4:		*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 3:		*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 2:		*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 1:		*Dst = ((unsigned short)(*Dst)*(*Src++))>>7; Dst++;
				case 0:;
			}
		}
		Src += srcinc;
		Dst += dstinc;
	}
	if(do32)SwapMMUMode(&mmumode);	/* restore */
}