Examining a page
================

Pages are dictionaries
----------------------

In PDFs, the main data structure is the **dictionary**, a key-value data
structure much like a Python ``dict`` or ``attrdict``. The major difference is
that the keys can only be **names**, while values can be any type, including
other dictionaries.

PDF dictionaries are represented as :class:`pikepdf.Dictionary`, and names
are of type :class:`pikepdf.Name`. A page is just another dictionary, with a
few required fields that give it special status as a page.

A :class:`pikepdf.Name` that is, usually, an ASCII-encoded string beginning with
"/" followed by a capital letter.

.. ipython::

    In [1]: from pikepdf import Pdf

    In [1]: example = Pdf.open('../tests/resources/congress.pdf')

    In [1]: page1 = example.pages[0]

    In [1]: page1

Item and attribute notation
---------------------------

Dictionary keys may be looked up using keys (``page1['/MediaBox']``) or
attributes (``page1.MediaBox``). The two conventions are equivalent.

.. ipython::

    In [1]: page1.MediaBox

    In [1]: page1['/MediaBox']

By convention, pikepdf uses attribute notation for keys in the PDF
specification and item notation for internal names within a PDF. For example

.. ipython::
    :verbatim:

    In [1]: page1.Resources.XObject['/Im0']

Here ``'/Im0'`` is an arbitrary name generated by the program that produced this
PDF, rather than a name in the specification like ``Resources`` and ``XObject``.
Item notation here would be quite cumbersome:
``['/Resources']['/XObject]['/Im0']`` (not recommended).

Attribute notation is convenient, but not robust if elements are missing. For
elements that are not always present, you can use ``.get()``, which behaves like
``dict.get()`` in core Python.  A library such as `glom
<https://github.com/mahmoud/glom>`_ might help when working with complex
structured data that is not always present.

repr() output
-------------

Returning to the page's output:

.. ipython::

    In [1]: page1

The angle brackets in the output indicate that this object cannot be
constructed with a Python expression because it contains a reference. When
angle brackets are omitted from the ``repr()`` of a pikepdf object, then the
object can be replicated with a Python expression, such as
``eval(repr(x)) == x``.

In Jupyter and IPython, pikepdf will instead attempt to display a preview of
the PDF page. An explicit ``repr(page)`` will show the text representation.

This page's MediaBox is a direct object. The MediaBox describes
the size of the page in PDF coordinates (1/72 inch multiplied by the value of
the page's ``/UserUnit``, if present).

.. ipython::

  In [1]: import pikepdf

  In [1]: page1.MediaBox

  In [1]: pikepdf.Array([ 0, 0, 200, 304 ])

The page's ``/Contents`` key contains instructions for drawing the page content.
Also attached to this page is a ``/Resources`` dictionary, which contains a
single XObject image. The image is compressed with the ``/DCTDecode`` filter,
meaning it is encoded with the :abbr:`DCT (discrete cosine transform)`, so it is
a JPEG. [#]_

.. [#] Without the JFIF header.


Viewing images
--------------

pikepdf provides a helper class :class:`~pikepdf.PdfImage` for manipulating
PDF images.

.. ipython::

    In [1]: from pikepdf import PdfImage

    In [1]: pdfimage = PdfImage(page1.Resources.XObject['/Im0'])

    In [1]: pdfimage
    Out[1]:

In Jupyter (or IPython with a suitable configuration) the image will be
displayed.

|im0|

.. |im0| image:: /images/congress_im0.jpg
  :width: 2in

You can also inspect the properties of the image. The parameters are similar
to Pillow's.

.. ipython::

    In [1]: pdfimage.colorspace

    In [1]: pdfimage.width, pdfimage.height

.. note::

    ``.width`` and ``.height`` are the resolution of the image in pixels, not
    the size of the image in page coordinates.

.. _extract_image:

Extracting images
-----------------

Extracting images is straightforward. :meth:`~pikepdf.PdfImage.extract_to` will
extract images to streams, such as an open file. Where possible, ``extract_to``
writes compressed data directly to the stream without transcoding. The return
value is the file extension that was extracted.

.. ipython::
    :verbatim:

    In [1]: pdfimage.extract_to(stream=open('file.jpg', 'w'))

You can also retrieve the image as a Pillow image:

.. ipython::
    :verbatim:

    In [1]: pdfimage.as_pil_image()

.. note::

    This simple example PDF displays a single full page image. Some PDF creators
    will paint a page using multiple images, and features such as layers,
    transparency and image masks. Accessing the first image on a page is like an
    HTML parser that scans for the first ``<img src="">`` tag it finds. A lot
    more could be happening. There can be multiple images drawn multiple times
    on a page, vector art, overdrawing, masking, and transparency. A set of
    resources can be grouped together in a "Form XObject" (not to be confused
    with a PDF Form), and drawn at all once. Images can be referenced by
    multiple pages.

.. _replace_image:

Replacing an image
------------------

See ``test_image_access.py::test_image_replace``.
