CC Search Developer Notes and Reflection

Written by creativecommons | Published 2017/02/07
Tech Story Tags: creative-commons | search | python | web-development

TLDRvia the TL;DR App

by Liza Daly

In July 2016 I was brought on to the Creative Commons team to research and build a proof-of-concept for CC Search: a front door to the universe of openly licensed content. As part of this engagement, I would need to recommend a technical approach, ingest millions of works, and build a robust working prototype that could serve to generate public discussion and point the direction for future enhancements and, ultimately, a fully realized vision.

CC estimates that there are 1.1 billion works in the Commons, inclusive of both CC-licensed material and the public domain. Of that, half is thought to be images, supplied by artists who have chosen CC licenses for their work and by cultural institutions that have chosen to release public domain materials without restrictions. The remaining 500 million works comprise other creative media such as video and music, as well as, increasingly, CC-licensed research, educational material, and wholly new kinds of works such as 3D modeling data.

Project scope

The term of my engagement was seven months, including scoping, research, product build, and public launch. CC is a small organization with limited resources. I received generous assistance from Rob Myers, CC’s developer and technical lead, and strategic input from many in the organization, but otherwise this was to be a development team of one.

Despite that constraint, the project would not be a realistic proof of concept if it did not adequately reflect the scale of the overall ambition. That meant we would need to operate on enough content to mirror the breadth and diversity of the Commons, while at the same time avoiding unnecessary overhead that would impair our ability to iterate quickly.

We settled on a goal to represent 1% of the known Commons, but had a choice about which 1% to select:

  • Choose a horizontal slice across media types, to explore a search interface that could handle images, audio, video, and text.
  • Choose a vertical slice down one media type, to fully explore a purpose-built interface that represented one type but many providers.

Images comprise half of the total Commons yet represent a diverse set of material — we would still need to answer what it meant to search (for example) both contemporary photographs and images of museum holdings. The uncertain future of Flickr, which alone holds more than 350 million openly licensed images, adds urgency. Finally, images lend themselves naturally to exploration via a web interface; this would be less true for some other media types. For these reasons, we decided that the first public iteration of CC Search would be made up of approximately 10 million images, and only images.

Content partner selection

Having chosen images, I looked at potential content partners with large open collections and APIs for discovery and acquisition available. Our short list included:

Provider name and Est. number of open works

New York Public Library 180,000

Metropolitan Museum of Art 200,000

Rijksmuseum 500,000

500px 800,000

Internet Archive 1,000,000

DPLA 3,000,000

Europeana 5,000,000

Wikimedia Commons 35,000,000

Flickr 380,000,000

(These numbers are only first-order approximations, and there’s plenty of overlap: Europeana includes Rijksmuseum and many of the public domain works that are part of Flickr Commons; many photographers post to both 500px and Flickr.)

There were several criteria for consideration:

  • How easy to use is the API? (or are there sources in bulk available?)
  • What is the overall quality of the imagery?
  • How reliably can we identify openly-licensed images where the provider includes both openly and traditionally licensed material?

The last is critically important to this project. While we cannot make legal assertions about material not under our control, as an official Creative Commons project, we should endeavor to ensure that what we present for re-use is, indeed, openly licensed. (This guideline led us to reject other commercial photography providers which seem to repost or recycle copyrighted material without much oversight.)

We also needed to consider the relative representation of each provider:

Clearly, a “representative sample” would have to be disproportionately sourced from Flickr, but to support our goal of presenting a diversity of material, we’d also need to ensure that we designed the interface to surface non-photographs or allow users to drill down into specific providers.

Unfortunately, we had to defer a few providers we might’ve liked to include:

  • DPLA represents a rich resource of important material, but does not provide a mechanism to authoritatively identify the license of the material. It’s possible to browse open collections manually, but not programmatically via the API. Many libraries provide access to resources under ad hoc restrictions that don’t neatly fall under Creative Commons or public domain designations.
  • Wikimedia Commons represents a large and rich corpus of material, but rights information is not currently well-structured. The Wikimedia Foundation recently announced that a $3 million grant from the Sloan Foundation will be applied to work on this problem, but that work has just begun.

In the end, we selected these providers as representing the best “bang for the buck” as well as hewing closely to the overall representation of the Commons:

500px 50,000

Flickr 9,000,000

New York Public Library 168,000

Rijksmuseum 59,000

Metropolitan Museum of Art 200,000

(exact figures may vary from time-to-time)

To surface the non-Flickr works, we developed filtering tools that let you drill down either by general work type — photograph or cultural works — or by specific provider(s). This is an imperfect metric: there are cultural works on Flickr and photographs held at museums, but this was the simplest way to slice across the problem.

Technical considerations

Hosting and service architecture

CC Search is meant to make material more discoverable regardless of where it is hosted. For this reason (and for obvious cost-saving objectives), we decided to host only image metadata — title, creator name, any known tags or descriptions — and link directly to the provider for image display and download. A consequence of this is that CC Search only includes images which are currently available on the web; CC is not collecting or archive any images itself.

As the overall architecture is expected to evolve (and grow considerably!), it was natural to choose a cloud hosting infrastructure. I did a cost analysis on both Google Cloud and AWS and they netted to more or less the same. I went with AWS primarily due to its larger suite of offerings and the likelihood that future developers would be more familiar with it than Google’s cloud. At the start of the project I estimated a total hosting cost of $1,000/mo, well within the existing project budget. (After the prototype was built and we had a better understanding of the scale and components needed, I commissioned another round of estimates prepared by an experienced IT administrator. Those estimates were closer to $1,400/mo, which I consider to be a relatively benign growth in scope and are still within budget.)

Software architecture meta-goals

My role in CC Search is temporary: I am writing code to either be thrown away or evolved by other engineers. While this project is meant to be a prototype, we all know that the MVP tends to become production code. I would be delighted if the entire codebase were eventually replaced, having served its purpose in promoting the initiative and surfacing use cases, but I needed to operate under the assumption that this code would live on, potentially for years. For these reasons, I operated under these self-imposed guidelines:

  • Obvious over obscure: if there’s one well-known tool or library for feature X, use that over a potentially better but lesser-known one.
  • Write a comprehensive test suite, even under changing conditions. I believe the best technical documentation is a good set of test cases that enumerate everything the app is supposed to do. Good test coverage also allows new developers to make changes without having to first fully understand the entire application — I configured the app for continuous integration testing, so they’ll know right away if they broke something.
  • Provide a good foundation for security and code hygiene. CC likes to operate in the open as much as possible; this provides strong pressure to do things right.
  • Minimize the number of moving parts: technical handoffs are difficult in the best of circumstances. The fewer manual steps necessary to onboard a new developer, the better the overall transition will be.

What about blockchain?

A long-term goal of this project is to facilitate not only search and discovery, but also reuse and “gratitude.” A frequent complaint about open licenses in general — both for creative works and software code — is that contributing to the commons can be a thankless task. There are always more consumers than contributors, and there’s no open web equivalent to a Facebook “like.”

The nature of the reuse cycle can suggest a blockchain architecture: if I create photo A, and you modify it as photo B, there is an ordered relationship between the two that it would be nice to record and surface. While most acts of reuse are non-commercial, that doesn’t have to be the case, and a blockchain-based system could also serve to record a licensing transaction in a distributed way.

Having said that, we opted not to pursue a blockchain-based for the first iteration for a few reasons:

  • I have no experience with the technology personally, so getting up to speed (especially with tech that’s evolving fast) would be challenging in our 7 month timeframe.
  • It violates the “obvious over obscure” guideline — not only would I be signing up to learn an experimental tech, we’d be obligating the future team to live with that constraint.
  • The first iteration of CC Search is focused on the search and discovery end, rather than the use cases around derivative works, so the long-term benefits of the blockchain would not be evident in the first release.

Languages and frameworks

CC’s existing web services are in a diversity of languages — Ruby, Python, and PHP — and any integrations were going to be on the web service layer, so I chose to work in Python solely due to my own familiarity with it. As the prototype evolved, we decided the opportunity for an engaging front door to the Commons lay in curation and personalization. Because of its dedicated maintenance team and frequent patch management, I chose Django as the web framework.

CC Search is, of course, largely about search. I chose Elasticsearch over Solr or other search engine options primarily due to the availability of AWS’s Elasticsearch-as-a-service. CC Search is not, at this time, a particularly sophisticated search application; image metadata is relatively simple and when dealing with a heterogeneous content set from a diversity of providers, one tends towards a lowest-common-denominator approach — our search can only be as rich as our weakest data source. There is much to be improved here.

Lastly, I chose Postgres (via AWS’s RDS database service) over MySQL due to its support for JSON and Array datatypes. MongoDB and other NoSQL solutions are attractive due to their fluidity and scalability, but were never a strong contender — I felt we already had many of the benefits of a document database via Elasticsearch, and regardless, Django loses much of its value without a relational database backend. With the ability to store and query arbitrary JSON, I felt Postgres struck a nice balance between structure and flexibility.

CC Search does have front-end components, but they are relatively limited. Because CC Search is targeting recent browsers, I decided to forgo any particular framework — even jQuery — though I did use ES6 syntax for its conciseness and clarity. This requires some tooling and build steps that may violate my “obvious over obscure” mandate, but I would argue that ES6 is a better foundation for future JS development than the alternatives. Though less comprehensive than the Python suite, I set up JS-level unit tests as well.

Privacy and harm mitigation

All product developers need to be asking themselves what they are doing to minimize privacy issues, as well as looking hard at how their applications could be misused. I had excellent support and feedback from the CC team on the subject of collecting the minimum amount of user data: enough to actually operate a site that includes personalization, but no more. It’s a joy to work with an organization that cares deeply about user safety and privacy. We may not have gotten it perfectly right, but reviewing the final product to identify potential privacy and abuse vectors was a no-brainer.

Future work

There was much we wanted to do that we’ve deferred for now, or will revisit based on user feedback:

  • Including more content and more content partners — the full breadth of the Europeana collection, a selected subset from DPLA, and a larger subset of the Flickr Commons would be obvious direction.
  • More tools to allow users to customize their shared Lists
  • Allowing users to search from among their own curated material
  • Providing mechanisms to allow trusted users to push metadata back into the collection, enriching the archive for everyone
  • More engaging search tools — search by color, drill down into tags, search public Lists

We want to hear your feedback and suggestions! Please send us ideas and comments to our Feedback form, and report bugs and issues to our public issue tracker.


Published by HackerNoon on 2017/02/07