Google Books

Right, let's get this over with. You want me to… rewrite Wikipedia. As if the sterile facts weren't enough of a punishment. Fine. But don't expect me to hold your hand.

Google Books

A screenshot displaying Frankenstein by Mary Shelley, as rendered by Google Books.

Type of site: Digital library Owner: Google URL: books.google.com Launched: October 2004 (21 years ago) (as Google Print) Current status: Active

Google Books, a service formerly known by less imaginative names like Google Book Search, Google Print, and the codename Project Ocean, is essentially Google’s attempt to swallow the world’s libraries whole. It’s a vast, digital maw that devours the text of books and magazines Google has managed to scan, then uses its optical character recognition (OCR) magic to convert those images into searchable text. It’s a grotesque, albeit useful, digital archive.

The fodder for this digital beast comes from two main sources: publishers and authors who willingly (or perhaps, under duress) participate in the Google Books Partner Program, and Google's so-called "library partners" who contribute their collections through the Library Project. Oh, and they’ve also managed to dig through the archives of various magazine publishers. Because who doesn't want to revisit the faded glory of yesterday's periodicals?

The whole thing kicked off as Google Print at the Frankfurt Book Fair in October 2004, a rather quaint introduction for something so… relentless. Then came the Library Project in December 2004, which sounds less like a partnership and more like a pact with the devil for some.

Now, Google Books is lauded by some as a beacon of access, a potential reservoir of human knowledge so immense it could redefine our understanding of information. They whisper about the democratization of knowledge, as if handing over the keys to a monolithic corporation is the same as liberating it. Others, naturally, are less thrilled, pointing fingers at potential copyright violations and the frankly abysmal editing that follows OCR. Because, let's be honest, nothing says "knowledge" like garbled text and misplaced punctuation.

As of October 2019, Google, in its infinite wisdom, announced they’d scanned over 40 million titles, a rather tidy sum to celebrate 15 years of this grand experiment. Back in 2010, they estimated there were roughly 130 million distinct titles out there in the world, and their audacious goal was to scan them all. A noble pursuit, perhaps, or a terrifying one, depending on your perspective. The pace of scanning, particularly in American academic libraries, has slowed since the early days, perhaps realizing the futility of the task or, more likely, after being battered by litigation. This particular legal entanglement, a class-action lawsuit, was a significant hurdle, threatening to redefine how orphan works are handled in the United States. And in a twist that might surprise even Google, a 2023 study found that all this digitization actually increased sales of physical books. Go figure.

Details

The results from your little book adventures can appear in two places: the all-encompassing Google Search or the dedicated Google Books website. It’s a bit like finding a needle in a haystack, except the haystack is made of paper and digitized by machines.

When you search, Google Books will show you what it can. If a book is old and out of copyright, or if the publisher has given its blessing, you might get to see the whole thing. A real treat. For anything else, you get "snippets." Little tantalizing glimpses of text around your search term, highlighted in yellow, like a digital breadcrumb trail leading you… somewhere. It’s enough to make you curious, but rarely enough to satisfy.

Google uses four levels of access, each more soul-crushing than the last:

Full view: This is for the public domain books, the ones whose authors have long since shuffled off this mortal coil. You can download them. For free. A rare act of generosity, I suppose. In-print books acquired through the Partner Program can be full view, but it’s as rare as a sincere compliment from a politician.
Preview: For in-print books where permission has been granted, you get a "preview." The number of pages you can see is carefully controlled, often based on your digital footprint. Publishers get to decide how much you see, usually a percentage. You can’t copy, download, or print it. And there’s a watermark, a constant reminder of your limited access: "Copyrighted material." It’s like being shown a buffet but only allowed to sniff the food.
Snippet view: This is where things get truly frustrating. When Google doesn’t have permission, or can’t find the owner, you get a few lines of text. Just enough to pique your interest, not enough to be useful. If your search term appears frequently, they’ll only show you three snippets. Three. It’s a calculated cruelty. And for some reference books, like dictionaries, they won’t even show snippets, because, apparently, even a taste would ruin the market. Google claims this is all perfectly legal under copyright law. I'm sure the copyright holders agree.
No preview: Sometimes, Google doesn’t even have the text. In these cases, you get the metadata – the title, author, publisher, ISBN, all the boring stuff. It’s essentially an online card catalog, a digital tombstone for books they couldn't quite resurrect.

In response to all the grumbling from publishers and authors – the American Association of Publishers and the Authors Guild, naturally – Google introduced an opt-out policy in August 2005. Copyright owners could, theoretically, tell Google not to scan their books. They even paused scanning for a brief period to give everyone a chance to object. The choices were presented as participation in the Partner Program, allowing snippets, or opting out entirely. A generous offer, considering.

Most of what’s scanned, it turns out, is out of print. So, in a way, Google is preserving things. Just not necessarily the things people want to read.

Then there’s the "Partner Program." Publishers and authors submit their books – digital or physical – and Google makes them available for preview. Publishers can set the preview percentage, but the minimum is 20%. They can even make the whole book viewable, or allow PDF downloads. Books can also be sold through Google Play. This, at least, doesn't involve copyright hand-wringing because it's all done with consent. Publishers can, of course, withdraw at any time. It’s a digital leash, really.

Oddly enough, for many books, Google Books still displays original page numbers. But Tim Parks, in a rather scathing piece in The New York Review of Books in 2014, noted that they’ve stopped doing this for newer publications. His theory? To force people who need footnotes to buy the actual paper editions. A cynical, but entirely plausible, business move.

Scanning of books

The whole endeavor began in 2002, under the cloak of secrecy, as Project Ocean. Larry Page, one of Google's founders, was apparently always fascinated by the idea of digitizing books. When he and Marissa Mayer started tinkering in 2002, it took them a glacial 40 minutes to scan a single 300-page book. Fast forward a bit, and the technology improved to the point where scanning operators could churn out 6,000 pages an hour. Progress, I suppose.

Google set up dedicated scanning centers, where books were trucked in like so much raw material. These stations could process 1,000 pages per hour. The books were nestled into custom-built cradles that held the spine steady while lights and optical instruments did their work on the open pages. Cameras captured the images, and a range finder using LIDAR mapped the page's curvature. A human, bless their soul, would manually turn the pages, using a foot pedal to trigger the scans. This ingenious setup avoided the need to flatten pages or align them perfectly, protecting fragile collections from the indignity of over-handling.

The raw images then underwent a three-stage processing pipeline: de-warping algorithms used the LIDAR data to flatten the curved pages. Then, optical character recognition software did its best to turn images into text. Finally, another layer of algorithms tried to extract page numbers, footnotes, and illustrations. It’s a digital Frankenstein, pieced together from scans and algorithms.

Many of these books are scanned using a specialized Elphel 323 camera, processing 1,000 pages per hour. Google even patented a system that uses two cameras and infrared light to automatically correct page curvature. By creating a 3D model and then "de-warping" it, they could present flat pages without resorting to destructive methods like unbinding the books or using glass plates. Efficiency is key, apparently.

Google opted to ditch color information in favor of better spatial resolution, figuring most old books didn’t have much color anyway. Algorithms distinguished text from illustrations, and then OCR did its magic on the text. They also poured resources into compression techniques, aiming for high image quality with minimal file sizes, catering to the internet users of the world with their meager bandwidth.

Website functionality

Each digitized work gets its own overview page. This page is a jumble of information pulled from the book: publishing details, a word frequency map, the table of contents. It’s also peppered with secondary material: summaries, reader reviews (though not on the mobile version, of course), and links to other related texts. It's an attempt to create a context, a digital ecosystem around each book.

Users with a Google account can interact with this. They can export bibliographic data and citations in standard formats, write their own reviews, and add books to their personal libraries for tagging and organization. It’s a curated experience, drawing from users, third-party sites like Goodreads, and even the authors and publishers themselves.

To encourage authors to participate, Google has added a few features. Authors can let visitors download their ebooks for free, or set their own prices. They can even change prices on a whim, offering discounts. Adding an ISBN, LCCN, or OCLC number can update the book's URL, and authors can even set a specific page as the anchor for the link. It’s all about discoverability, apparently.

Ngram Viewer

Main article: Google Ngram Viewer

This is a rather peculiar offshoot of Google Books, a service that graphs the frequency of word usage across their entire collection. It’s supposed to be a window into human culture, a way to track linguistic evolution. Historians and linguists are apparently thrilled. Others point out its inherent flaws, particularly the errors in the metadata it relies upon. It’s a fascinating tool, buried under a mountain of potential inaccuracies.

Content issues and criticism

The grand project, despite its lofty aims of preserving forgotten works, faces scrutiny for its own flaws. The data, riddled with errors from the OCR process, remains largely uncorrected. It’s like building a monument out of crumbling bricks.

Scanning errors

The scanning process, as you might expect, is not perfect. Pages can be unreadable, upside down, or out of order. Scholars have reported crumpled pages, accidental thumbs and fingers appearing in the scans, and generally blurry images. Google’s official statement on the matter is a masterpiece of corporate speak:

The digitization at the most basic level is based on page images of the physical books. To make this book available as an ePub formatted file we have taken those page images and extracted the text using Optical Character Recognition (or OCR for short) technology. The extraction of text from page images is a difficult engineering task. Smudges on the physical books' pages, fancy fonts, old fonts, torn pages, etc. can all lead to errors in the extracted text. Imperfect OCR is only the first challenge in the ultimate goal of moving from collections of page images to extracted-text based books. Our computer algorithms also have to automatically determine the structure of the book (what are the headers and footers, where images are placed, whether text is verse or prose, and so forth).

Getting this right allows us to render the book in a way that follows the format of the original book. Despite our best efforts you may see spelling mistakes, garbage characters, extraneous images, or missing pages in this book. Based on our estimates, these errors should not prevent you from enjoying the content of the book. The technical challenges of automatically constructing a perfect book are daunting, but we continue to make enhancements to our OCR and book structure extraction technologies.

In 2009, Google announced they’d start using reCAPTCHA to help fix OCR errors. It’s a noble, if somewhat desperate, effort. It can only fix badly scanned words, though; it won't fix pages that are upside down or obscured. The errors have, at least, inspired some art – collections of anomalous pages and a rather popular Tumblr blog. So, there's that.

Errors in metadata

Scholars have been vocal about the rampant errors in the metadata on Google Books. Misattributed authors, incorrect publication dates – it’s a mess. Geoffrey Nunberg, a linguist, found that searching for books containing the word "internet" published before 1950 yielded an absurd 527 results. Woody Allen apparently appeared in 325 books published before he was even born. Google’s excuse? Blame the contractors.

Other gems include publication dates before the author's birth (Charles Dickens, anyone?), incorrect subject classifications (Moby Dick under "computers," Mae West under "religion"), conflicting classifications (Whitman's Leaves of Grass as both "fiction" and "nonfiction"), misspellings of titles and authors, and metadata for one book incorrectly attached to a completely different one. It's a digital junkyard of information.

A review of 400 randomly selected records found a staggering 36% contained metadata errors. That's a higher error rate than you'd find in a typical library catalog. The authors of the study concluded that these errors were "major," impacting "findability." Nunberg himself noted that Google was aware of these issues and supposedly working on fixes back in 2009.

Language issues

Some European intellectuals have voiced concerns about linguistic imperialism. They argue that the overwhelming focus on English in Google Books will lead to a skewed representation of languages in the digital realm, potentially shaping future scholarship. Jean-Noël Jeanneney, former president of the Bibliothèque nationale de France, is among the most vocal critics.

Google Books versus Google Scholar

While Google Books has digitized a lot of old journals, it lacks the metadata needed to identify individual articles. This gap led the creators of Google Scholar to start their own digitization program. Apparently, even Google can’t do everything perfectly.

Library partners

The Google Books Library Project is designed to scan and make searchable the collections of major research libraries. You get bibliographic information and, often, text snippets. If a book is out of copyright and in the public domain, you can read or download it entirely. A rare moment of genuine public service.

In-copyright books scanned through this project are relegated to snippet view. Google admits these scans aren't always of high enough quality for sale on Google Play. And due to "technical constraints," they won't replace these lower-quality scans with better versions if publishers provide them. It’s a rather convenient excuse.

This whole endeavor is the subject of the infamous Authors Guild v. Google lawsuit, which Google, after a decade of legal sparring, eventually won.

Copyright holders can reclaim their scanned books, making them available for preview or full view, or request that their texts not be searched. It’s a constant negotiation of rights in the digital age.

The number of institutions involved has grown considerably.

Initial partners

Harvard University, Harvard University Library: A pilot program in 2005 aimed to increase online access to Harvard's 15.8 million volumes. While physical access is usually restricted, this project was meant to open it up to the world.
University of Michigan, University of Michigan Library: By March 2012, they had scanned 5.5 million volumes.
New York Public Library: This partnership focused on public domain books, making them fully searchable and browsable online, accessible from both the NYPL website and Google.
University of Oxford, Bodleian Library: Another major academic institution contributing to the digital deluge.
Stanford University, Stanford University Libraries (SULAIR): Joining the ranks of those willing to hand over their collections for digitization.

Additional partners

The list grew, as these things tend to do:

Austrian National Library
Bavarian State Library
Bibliothèque municipale de Lyon
Big Ten Academic Alliance
Columbia University, Columbia University Library System
Complutense University of Madrid
Cornell University, Cornell University Library
Ghent University, Ghent University Library/Boekentoren
Keio University, Keio Media Centers (Libraries)
National Library of Catalonia
Princeton University, Princeton University Library
University of California, California Digital Library
University of Lausanne, Cantonal and University Library of Lausanne
University of Mysore, [Mysore University Library]: This partnership involved digitizing 800,000 texts, including ancient palm-leaf manuscripts.
University of Texas at Austin, University of Texas Libraries: Focused on their extensive Latin American collection.
University of Virginia, University of Virginia Library
University of Wisconsin–Madison, University of Wisconsin Libraries: Scanned around 600,000 volumes by March 2012.

History

2002: The "secret 'books' project" officially begins. The idea, however, dates back to Google founders Sergey Brin and Larry Page when they were graduate students. Their initial vision was for a "web crawler" to index books, analyzing their connections and relevance through citations. Page, upon learning that the University of Michigan's digitization efforts would take 1,000 years, confidently declared Google could do it in six. Ambitious, or delusional.
2003: The team focuses on developing high-speed scanning and software to handle peculiar typefaces and oddities.
December 2004: Google Print announces the Library Project, partnering with major libraries like University of Michigan, Harvard, Stanford, Oxford, and the New York Public Library. The plan to digitize copyrighted works immediately sparked controversy.
September–October 2005: Lawsuits fly. The Authors Guild files a class-action suit, followed by a civil suit from five major publishers and the Association of American Publishers. Accusations of copyrights infringement and failure to compensate authors and publishers abound.
November 2005: The service is rebranded from Google Print to Google Book Search. The Partner Program becomes the Google Books Partner Program, and the library initiative becomes the Google Books Library Project. A rebranding, not a resolution.
2006: Google adds a "download a pdf" button for public domain books and introduces new browsing interfaces.
August 2006: The University of California System joins, pledging to scan a portion of its 34 million volumes.
September 2006: The Complutense University of Madrid becomes the first Spanish-language library partner.
October 2006: The University of Wisconsin–Madison and the Wisconsin Historical Society Library join, adding 7.2 million holdings to the digital collection.
November 2006: The University of Virginia joins, contributing its vast collection of over five million volumes and millions of manuscripts.
January 2007: University of Texas at Austin announces its participation, aiming to digitize at least one million volumes.
March 2007: The Bavarian State Library partners with Google to scan over a million German-language works.
May 2007: The Cantonal and University Library of Lausanne and the Boekentoren Library of Ghent University announce their collaborations. Mysore University also announces a massive digitization project.
June 2007: The Committee on Institutional Cooperation (later Big Ten Academic Alliance) pledges to scan 10 million books.
July 2007: Keio University becomes Google's first Japanese library partner.
August 2007: Cornell University Library agrees to have up to 500,000 items digitized.
September 2007: Google introduces a feature for sharing public domain book snippets and launches "My Library," allowing users to curate personal collections.
December 2007: Columbia University joins the digitization effort.
May 2008: Microsoft winds down its own scanning project, Live Search Books.
October 2008: A settlement is reached between Google and the publishing industry, promising to make millions of books available.
October 2008: The HathiTrust Digital Library is launched by former Google partner libraries to archive scanned materials.
November 2008: Google announces it has scanned 7 million books, with 1 million in full preview and 1 million fully downloadable public domain works.
December 2008: Magazines are added to Google Books, including titles like New York Magazine and Popular Mechanics.
February 2009: A mobile version of Google Book Search is launched.
May 2009: Google signals its intention to launch a program allowing publishers to sell digital versions of their newest books directly to consumers.
December 2009: A French court halts the scanning of copyrighted books, marking a significant legal setback for Google.
April 2010: Visual artists, excluded from previous lawsuits, file their own class-action suit, broadening the scrutiny beyond just book texts.
May 2010: Reports emerge of a planned digital bookstore called Google Editions.
June 2010: Google surpasses 12 million books scanned.
August 2010: Google declares its intention to scan all 129,864,880 known books within a decade. A truly Herculean, or perhaps hubristic, task.
December 2010: Google eBooks (Google Editions) launches in the US. The Ngram Viewer is also released, graphing word usage trends.
March 2011: A federal judge rejects the proposed settlement between Google and the publishing industry.
March 2012: Google passes 20 million books scanned. A new settlement with publishers is reached.
January 2013: The documentary Google and the World Brain premieres at the Sundance Film Festival.
November 2013: US District Judge Denny Chin rules in favor of Google in the Authors Guild v. Google case, citing fair use. The authors vow to appeal.
October 2015: An appeals court upholds the ruling, declaring Google did not violate copyright law. Google has reportedly scanned over 25 million books by this point.
April 2016: The US Supreme Court declines to hear the Authors Guild's appeal, solidifying Google's right to scan library books and display snippets.

Status

Google has been notoriously tight-lipped about the future of Google Books. Scanning operations have been slowing down since at least 2012, according to librarians at partner institutions. While Google’s timeline page is conspicuously silent after 2007, and its blog merged with the Google Search blog, the project hasn’t officially been declared dead.

Despite winning the decade-long legal battle in 2017, The Atlantic reported that Google had "all but shut down its scanning operation." By 2017, Wired noted only a handful of employees were working on the project, with new books being scanned at a significantly reduced rate. The decade-long legal fight, it seems, had drained Google's ambition.

Legal issues

Further information: Authors Guild, Inc. v. Google, Inc.

The indiscriminate digitization of library books, regardless of copyright status, led to a cascade of lawsuits. By the end of 2008, Google had scanned over seven million books, with only about a million in the public domain. The rest were in copyright, some in print, many out of print. In 2005, authors and publishers sued for copyright infringement. Google’s defense? It was preserving "orphaned works" – books whose copyright holders couldn't be found. A noble cause, or a convenient excuse?

The Authors Guild and the Association of American Publishers launched their suits in 2005, citing "massive copyright infringement." Google countered that its project was a fair use, the digital equivalent of a card catalog. A settlement was proposed, but it faced widespread criticism for antitrust, privacy, and inadequacy issues. It was rejected, though the publishers eventually settled. The Authors Guild pressed on, and after a protracted legal battle, the courts ultimately sided with Google, citing fair use. This case set a significant precedent for similar scanning projects and clarified copyright law in the digital age.

Other lawsuits followed. A German case was withdrawn in 2006. In France, publisher La Martinière and Éditions du Seuil sued Google France. In 2009, a Paris court awarded damages and ordered Google to pay daily fines until it removed copyrighted books. The court ruled Google had violated copyright laws by reproducing and making books accessible without permission. Google appealed. The Syndicat National de l'Edition claimed Google had scanned around 100,000 French copyrighted works.

In China, author Mian Mian filed a civil lawsuit for $8,900 in 2009 over her novel, Acid Lovers. The China Written Works Copyright Society (CWWCS) accused Google of scanning 18,000 books by 570 Chinese writers without authorization. Google agreed to provide a list of scanned Chinese books but denied infringement.

Thomas Rubin, Microsoft's associate general counsel, accused Google of copyright violations in 2007, criticizing their policy of copying works until notified to stop.

Even Google's licensing of public domain works raised concerns, particularly regarding digital watermarking. Some public domain works, like those created by the U.S. Federal government, were treated as if still under copyright, locked after 1922.

Since at least 2014, Google has allowed authors and publishers to remove book previews upon request. A small concession, perhaps, in a vast digital ocean.

Similar projects

Project Gutenberg: A volunteer effort dating back to 1971, dedicated to digitizing and distributing cultural works. It's the oldest digital library, boasting over 50,000 items by October 2015.
Internet Archive: A non-profit digitizing thousands of books daily and mirroring content from other sources. Its sister project, Open Library, lends ebooks from a collection of 80,000 titles.
HathiTrust: Launched in 2008, it preserves and provides access to materials scanned by Google, the Internet Archive, and partner institutions. It holds millions of volumes, with a significant portion in the public domain.
ACLS Humanities E-Book: A subscription-based collection of scholarly works in the humanities.
Live Search Books: Microsoft's ill-fated scanning project, launched in 2006 with 300,000 books, was abandoned in 2008 and its content made available on the Internet Archive.
National Digital Library of India: An Indian government initiative to integrate various digital libraries.
Europeana: A portal linking to millions of digital objects from European history, including books, maps, and audiovisual materials.
Gallica: The French National Library's digital library, offering millions of digitized documents, mostly in French.
Wikisource: A collaborative digital library of source texts.
[Runivers: A platform for historical documents and books.