Thanks to digitization, today more than 55 million objects from the collections of Europe's libraries, archives and museums are available online to explore and reuse via Europe's digital library, Europeana. This talk will introduce some of the main activities, datasets and tools in the digitization of cultural heritage. The main focus will be on the Europeana Newspapers collection, a public domain licensed dataset of roughly 12 million pages of historic newspapers, and the possibilities and challenges in analyzing such large and heterogeneous textual resources with common NLP and machine learning tools.