PDF is well known as a document format that you can’t easily change. One can obtain images from a Word file with no effort but a PDF file. This blog will introduce the approaches to extracting images and what’s the difference.
Why Extract Images From PDFs?
We all have met this situation: When you get PDF materials (For working, studying, or other purposes), you need to extract all images from PDF and use them further like producing presentations, reports, or redesigning something else. Many people want to find an efficient and reliable way to finish this work.
Under the need of processing PDFs, there are a lot of companies that try to bring this function to their applications to provide their users with an efficient tool to deal with work or study. Nowadays, there is no need for companies to develop the functions you want by themselves. But which to choose is important. I will give you some examples and you will see how easy and simple the method of ComPDFKit PDF SDK is.
The Difference Between Extracting Images and Pages
PDF pages are much easier to extract than images. Images are a part of the pages, when you find out the pages on which the images you want to extract are located, you need to query and extract the images from the internal PDF structure. Different SDKs have their logic to achieve that. Let’s see the different logic between Apple's Core Graphics & PDFKit and ComPDFKit PDF SDK.
Extract Images with Different SDKs
Using Core Graphics & PDFKit
For extracting embedded images with Core Graphics, the system-provided API, it’s necessary to know the PDF structure first. Without a doubt, it is complex. Let's take a look at the basic logic in Objective-C.
Open the document with CGPDFDocument
and obtain the appropriate CGPDFPage
where the image we want to extract is placed. Then we get the Resources dictionary entry, which is represented by a CGPDFDictionary
object. Traversing the PDF tree further, we retrieve the XObject
entry, which might contain multiple elements on a PDF page, including images. Then try to get the image stream of the first image on this page (a CGPDFStream
object) which the Im0
entry might represent. Finally, we extract the image data from the stream object, which we convert into a UIImage
object. Here is the code sample.
- (UIImage *)getEmbeddedImageofPDFAtURL:(NSURL *)url pageIndex:(NSInteger)pageIndex {
CGPDFDocumentRef document = CGPDFDocumentCreateWithURL((__bridge CFURLRef)url);
if (document == NULL) {
printf("Couldn't open PDF.");
return nil;
}
CGPDFPageRef page = CGPDFDocumentGetPage(document, pageIndex + 1);
CGPDFDictionaryRef dictionary = CGPDFPageGetDictionary(page);
if (dictionary == NULL) {
printf("Couldn't open page.");
return nil;
}
CGPDFDictionaryRef resources = 0;
CGPDFDictionaryGetDictionary(dictionary, "Resources", &resources);
if (resources == NULL) {
printf("Couldn't get Resources.");
return nil;
}
CGPDFDictionaryRef xObject = 0;
CGPDFDictionaryGetDictionary(resources, "XObject", &xObject);
if (xObject == NULL) {
printf("Couldn't load page XObject.");
return nil;
}
CGPDFStreamRef imageStream = 0;
CGPDFDictionaryGetStream(xObject, "Im0", &imageStream);
if (imageStream == NULL) {
printf("No image on PDF page.");
return nil;
}
CGPDFDataFormat format = CGPDFDataFormatRaw;
CFDataRef data = CGPDFStreamCopyData(imageStream, &format);
if (data == NULL) {
printf("Couldn't convert image stream to data.");
return nil;
}
UIImage *image = [UIImage imageWithData:(NSData *)data];
return image;
}
To detect all the images with Core Graphics, you can use this logic.
NSMutableArray *imageKeys = [NSMutableArray array];
CGPDFDictionaryApplyBlock(xObject, ^bool(const char * _Nonnull key, CGPDFObjectRef _Nonnull value, void * _Nullable info) {
CGPDFStreamRef stream = 0;
CGPDFObjectGetValue(value, kCGPDFObjectTypeStream, &stream);
CGPDFStreamRef objectStream = stream;
CGPDFDictionaryRef streamDictionary = CGPDFStreamGetDictionary(objectStream);
if (!streamDictionary) {
return true;
}
const char *subtype = NULL;
CGPDFDictionaryGetName(streamDictionary, "Subtype", &subtype);
if (!subtype) {
return true;
}
if (!strcmp(subtype, "Image")) {
[imageKeys addObject:[NSString stringWithUTF8String:key]];
}
return true;
}, nil);
for (NSString *imageKey in imageKeys) {
CGPDFStreamRef imageStream = 0;
CGPDFDictionaryGetStream(xObject, [imageKey UTF8String], &imageStream);
if (imageStream == NULL) {
printf("Couldn't get image stream.");
return nil;
}
CGPDFDataFormat format = CGPDFDataFormatRaw;
CFDataRef data = CGPDFStreamCopyData(imageStream, &format);
if (data == NULL) {
printf("Couldn't convert image stream to data.");
return nil;
}
UIImage *image = [UIImage imageWithData:(NSData *)data];
if (!image) {
printf("Couldn't convert image data to image.");
return nil;
}
return image;
}
With PDFKit, you can’t extract images. But it’s available to execute the logic mentioned in the section above by querying the CGPDFPage
instance via the PageRef
property on PDFPage
, which in turn can be accessed via pageAtIndex:
from the PDFDocument.
Using ComPDFKit
ComPDFKit PDF SDK for iOS supports extracting embedded images from PDFs. You can grab the images in PDFs directly without the knowledge of the PDF structure. Keep reading to learn this easy way.
To extract images from a PDF document, use the function CPDFDocument::extractImageFromPages:toPath:
. Extracting images from a page is time-consuming, and you are advised to perform this operation asynchronously. In addition, you can use CPDFDocument::cancelExtractImageFromPages:
to cancel the operation. The code below will grab all images from the first page of the given PDF document.
NSURL *url = [NSURL fileURLWithPath:@""];
CPDFDocument *document = [[[CPDFDocument alloc] initWithURL:url] autorelease];
NSIndexSet *pages = [NSIndexSet indexSetWithIndex:0];
NSString *imagePath = @"";
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
[document extractImageFromPages:pages toPath:imagePath];
});
Conclusion
We can see two methods for extracting embedded images from PDFs in this post. CGPDF
APIs are already available by default on iOS, but this logic is complicated. With our ComPDFKit PDF SDK, the simple and fast way to get images from PDF, you can just obtain all the images with a few lines of code. For more detail about extracting images or tech support, please, contact our support team.