Documents

The ads.Document object

The ads.Document object is used to represent a record in ADS. We use this data model and an object-relational mapper (ORM) to search and filter for documents on ADS in a programmatic way. For example, you can write queries like

expression = (Document.year < 2020) & Document.title.like("JWST")

and have them translated into a (Solr) query that you would input through the ADS web interface.

Fields and operators

The ADS search engine has some fields and operators that are searchable but not viewable. That means you can search for documents using this field, but when you retrieve a document that field will be None.

The ads.Document data model reflects this functionality as closely as possible by using:

  • typed fields for searchable and viewable Solr fields

  • virtual fields for those that are searchable but not viewable, and

  • functions for Solr operators.

Tip

All typed fields, virtual fields, and functions can be accessed as attributes of the ads.Document object.

That means if you’re using an interactive Python environment (e.g., a python shell, an iPython shell, or Jupyter) then you can access all of these by typing ads.Document and using :kbd:<tab> completion.

The following fields can be searched and viewed:

ads.Document.abstract

The document abstract.

ads.Document.aff

The raw, provided affiliation field.

ads.Document.aff_id

Curated affiliation identifier, parsed from the given affiliation string.

ads.Document.alternate_bibcode

An alternate bibcode.

ads.Document.alternate_title

Alternate title, usually present when the original title is not in English.

ads.Document.arxiv_class

The arXiv class a document was submitted to.

ads.Document.author

Author name.

ads.Document.author_count

The number of authors.

ads.Document.author_norm

Author name in the form 'Lastname, F'.

ads.Document.bibcode

Document bibliographic code.

ads.Document.bibgroup

Records by bibliographic groups, curated by staff outside of ADS.

ads.Document.bibstem

Abbreviated name of the journal or publication (e.g., ApJ).

ads.Document.citation_count

The number of citations to this document.

ads.Document.citation_count_norm

Normalised citation count.

ads.Document.cite_read_boost

The normalized boost factor.

ads.Document.classic_factor

A classic prestige score, designed to more highly rank papers that are relevant and popular now.

ads.Document.data

Related data sources.

ads.Document.database

The database the document resides in (e.g., astronomy or physics).

ads.Document.date

Publication date, represented by a time format and used for indexing.

ads.Document.doctype

Document type (e.g., article, thesis, etc.

ads.Document.doi

Digital object identifier

ads.Document.eid

Electronic identifier of the paper, which is the equivalent of a page number.

ads.Document.email

Email addresses of the authors.

ads.Document.entry_date

Creation date of the ADS record.

ads.Document.esources

Types of electronic sources available for a record (e.g., PUB_HTML, EPRINT_PDF).

ads.Document.facility

Facilities declared in a record, based on a controlled list by AAS journals.

ads.Document.grant

Search by grant identifiers and grant agencies (ads.Document.grant_id and ads.Document.grant_agencies).

ads.Document.grant_agencies

A field with just the grant agencies name (e.g., NASA).

ads.Document.grant_id

Search by grant identifier.

ads.Document.id

A unique identifier for the document, curated by ADS.

ads.Document.identifier

Search by an array of alternative identifiers for a record.

ads.Document.indexstamp

Datetime when the document was last index by the ADS Solr service.

ads.Document.isbn

International Standard Book Number

ads.Document.issn

International Standard Serial NUmber

ads.Document.issue

Issue number of the journal that includes the article.

ads.Document.keyword

An array of normalized and non-normalized keyword values associated with the record.

ads.Document.lang

The language of the main title.

ads.Document.links_data

Information on what linked documents are available.

ads.Document.nedid

List of NED IDs for a record.

ads.Document.nedtype

Keywords used to describe the NED type (e.g., galaxy, star).

ads.Document.orcid_other

ORCID claims from users who used the ADS claiming interface.

ads.Document.orcid_pub

ORCIDs supplied by publishers.

ads.Document.orcid_user

ORCID claims from users who gave ADS consent to expose their public profile.

ads.Document.page

Page number of a record.

ads.Document.page_count

The difference between the first and last page numbers in ads.Document.page_range.

ads.Document.property

An array of miscellaneous flags associated with a record.

ads.Document.pub

The canonical name of the publication that the record appeared in.

ads.Document.pub_raw

Name of the publisher, but also includes the volume, page, and issue if exists.

ads.Document.pubdate

Publication date in the form YYYY-MM-DD, where DD will always be '00'.

ads.Document.score

The closeness of the document to the query match.

ads.Document.simbid

List of SIMBAD IDs within a document.

ads.Document.simbtype

Keywords used to describe the SIMBAD type.

ads.Document.read_count

The number of times the record has been viewed within a 90 day window.

ads.Document.title

The title of the record.

ads.Document.vizier

Keywords, 'subject' tags from Vizier.

ads.Document.volume

The journal volume.

ads.Document.year

Year of publication.

Important

Just because a field is viewable doesn’t mean you will always get a value. For example, if a record has no email address information about the authors, then doc.email will return something like ['-', '-'] for a two-author record. In other situations the value returned might just be None.

 

These virtual fields can be searched, but not viewed:

 

These functions access Solr operators to search for documents, but don’t return a value:

ads.Document.citations

Return documents that cite documents returned by the given expression.

ads.Document.citis

A different implementation of ads.Document.citations that uses less memory, but is slower.

ads.Document.classic_relevance

A toy implementation of the ADS Classic relevance score algorithm.

ads.Document.instructive

Return documents that are the synonymn of ads.Document.reviews.

ads.Document.join_citations

This operator is the equivalent of ads.Document.citations but using a Lucene block-join.

ads.Document.join_references

This operator is the equivalent of ads.Document.references but using a Lucene block-join.

ads.Document.pos

Return documents where the given expression is matched only by an item in a given position, or range of positions (e.g., author or affiliation position).

ads.Document.references

Return documents that are cited by documents in the given expression.

ads.Document.reviews

The reviews operator takes the list of articles which cited the papers in the given expression, combines them, and returns this list sorted by how frequently a citing paper appears in the combined list.

ads.Document.similar

Return documents that have abstracts similar in wording to documents returned by the given expression.

ads.Document.top_n

Return the top n documents matching the given expression.

ads.Document.trending

Query documents that are most read by users who read recent papers on the topic being researched.

ads.Document.useful

Return documents frequently cited by the most relevant papers returned by the given expression.

The ads.Document.useful2 and ads.Document.reviews2 operators are also available, but ‘not’ listed here to avoid confusion.

Search for documents

The ads.Document has a ads.Document.select() function to select (search for) documents. If you’ve used a SQL database before, you’ll notice this approach is very similar to how you would select records from a SQL database.

When you call ads.Document.select it returns a ModelSelect object. This object has everything it needs to perform the ADS search, but it won’t actually do anything until you try to access documents (e.g., by iterating over the ModelSelect object). For example:

from ads import Document

# The .where() and .limit() functions will be explained below. 
docs = (
    Document.select()
            .where(
                Document.author == "Ness, M"
            )
            .limit(3)
)

print(f"# {type(docs)}")
# <class 'ads.ModelSelect'>

# *Nothing* has been sent to ADS yet.
# The request is only executed when we try to access the results:
for doc in docs: # Now the request is made to ADS
    print(f"# {doc}")
# <Document: bibcode=2017AJ....154...28B>
# <Document: bibcode=2018ApJS..235...42A>
# <Document: bibcode=2017ApJS..233...25A>

We can access (or iterate over) the documents docs as many times as we want, but the query is only executed to ADS once.

Select fields

If you only want specific fields returned by ADS, you can explicitly give these to ads.Document.select(). If you want all the fields of a document then don’t supply anything to .select(). For example:

from ads import Document

# Give me everything (this is the usual case).
everything = Document.select()

# I only want bibcodes, author counts, and citation counts.
something = Document.select(
    Document.bibcode, Document.author_count, Document.citation_count
)

If you made the second query above (something) and then later decided you also needed the title of a document (or some other field), don’t worry: the data attributes on ads.Document are special and will lazily retrieve the missing data for you. This is considered bad practice because it increases the number of API calls you use, and for this reason you will see a (single) warning.

Note that even if we select only single field, when we iterate over ads.Document.select() we will always get a ads.Document object. Below is an example. The input Python code is shown in the first tab, and the second tab shows the output.

from ads import Document

# Give me citation counts for these documents
docs = (
    Document.select(
                Document.citation_count,
            ).where(
                Document.bibcode.in_(["1996PhRvL..77.3865P", "1996PhRvB..5411169K"])
            )
)

for doc in docs:
    # Oh shit, I want their titles too.
   print(f"{doc.bibcode}: {doc.title[0]} ({doc.citation_count:,} citations)")
LazyAttributesWarning: You're lazily loading document attributes, which makes many calls to the API. This will impact your rate limits. If you know what document fields you want ahead of time, provide them as arguments to `Document.select()`.

1996PhRvL..77.3865P: Generalized Gradient Approximation Made Simple (61,563 citations)
1996PhRvB..5411169K: Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set (34,986 citations)

The ads.Document.bibcode and ads.Document.id will always be requested by ads.Document.select, even if you didn’t ask for them, because these fields are required to make future queries against ADS, and to evaluate whether two documents are different.

Important

If you’re only lazily loading a few fields, or a single field for a few documents, then that’s OK. But you should know that lazy loading a single field in a single document requires an API call. If you are lazily loading 5 fields in 10 documents, that’s 50 API calls. For contrast, a single API call to the ADS search service can return 200 documents with all fields.

Lazy loading is a feature to help you code interactively without having to repeat complex searches, but it is not intended to be used often.

Filtering

By default, ads.Document.select() on it’s own would return an iterable that would (eventually) retrieve every document from ADS. That’s probably not what you want; you should provide some filter to return particular documents.

The way we filter for documents is by using the .where() function with ads.Document.select(). Here is an example:

from ads import Document

docs = (
    Document.select()
            .where(Document.year == 2015)
)

We can combine multiple filters using the standard & (meaning ‘and’) and | (meaning ‘or’) operators in Python. For example, let’s look for documents authored by “Lastname, First” in 2015 or 2018-2019:

from ads import Document

docs = (
    Document.select()
            .where(
                (Document.author == "Lastname, First")
            &   (
                    Document.year.between(2018, 2019)
                |   (Document.year == 2015)
                )
            )
)

You can include any searchable field in the expression you give to .where(). You can also include operators, which are accessible as functions:

from ads import Document

docs = (
    Document.select()
            .where(
                Document.trending(Document.title == "exoplanets")
            &   Document.author_count.between(1, 5)
            )
)

Important

Although you may be tempted to use python’s in, and, or, and not operators in your query expressions, these will not work. The return value of an in expression is always coerced to a boolean value by Python. Similarly, and, or, and not all treat their arguments as boolean values and cannot be overloaded. So as a guide:

  • Use | instead of or

  • Use & instead of and

  • Use ~ instead of not

  • Use .in_() and .not_in() instead of in and not in

You should always wrap each expression in brackets, too. This is because of the way Python treats operator precedence. That means:

  • Good: (Document.year == 2000) & (Document.title.like("JWST"))

  • Bad: Document.year == 2000 & Document.title.like("JWST")

Ordering

The default ordering for documents returned by ADS is by relevance (computed by the ADS search engine). Instead, you can sort by any of the following fields:

ads.Document.id

A unique identifier for the document, curated by ADS.

ads.Document.author_count

The number of authors.

ads.Document.bibcode

Document bibliographic code.

ads.Document.citation_count

The number of citations to this document.

ads.Document.citation_count_norm

Normalised citation count.

ads.Document.classic_factor

A classic prestige score, designed to more highly rank papers that are relevant and popular now.

ads.Document.date

Publication date, represented by a time format and used for indexing.

ads.Document.entry_date

Creation date of the ADS record.

ads.Document.read_count

The number of times the record has been viewed within a 90 day window.

ads.Document.score

The closeness of the document to the query match.

Use the .order_by() function to order results:

from ads import Document

# Get the freshest bangerz
docs = (
    Document.select()
            .order_by(
                Document.entry_date.desc()
            )
)

# Sort by multiple ordered fields
docs = (
    Document.select()
            .order_by(
                Document.author_count.desc(),
                Document.citation_count.asc()
            )
)

Limits

It’s good practice to set a limit for the number of documents you want. This reduces the load on the ADS search service, and it helps the ads package to paginate queries ahead of time.

Most of the search examples we have shown so far don’t have any limits applied. That means they will return every document found by ADS that matches the expression.

We can limit the number of documents retrieved by applying a .limit() to our .select() call:

from ads import Document

docs = (
    Document.select()
            .where(
                (Document.year > 2020)
            )
            .order_by(
                Document.citation_count.desc()
            )
            .limit(10)
)

# Or if we have no filters or sort:
docs = Document.select().limit(3)

Advanced document searches

You can search for documents using expressions that include other data model objects. For example:

Use explicit Solr queries

If all of this search syntax with Document.select() and .where() and .order_by() and .limit() is too frightening or confusing, you can always give an explicit search query to ADS. The way to do this is the same as how you performed searches in previous versions of the ads package, in order to remain as backwards-compatible as possible.

Here’s how you would give an explicit search query:

from ads import Document, SearchQuery

docs = SearchQuery(
    q="author:'Ness, M' AND year:2018", 
    fl=["bibcode", "title", "author", "citation_count"]
)

# SearchQuery will return a `ModelSelect` object, 
# which you can use the same as a `ads.ModelSelect` object
for doc in docs:
    print(doc)

# Here's the same query in the "new" way, except we retrieve all fields
docs = (
    Document.select()
            .where(
                (Document.author == "Ness, M")
            &   (Document.year == 2018)
            )
)

If you’re upgrading from ads 0.12.3 then your existing code might still work, but you’ll notice some name changes. For example, when you iterate over a function call of ads.SearchQuery you will get ads.Document objects, not ads.Article objects. If you have trouble replicating your existing queries in the new search format, please create an issue on GitHub.

BigQuery

Most ADS searches use the /search/query API endpoint. However, if the search requires checking for a large number of explicit bibcodes, this can be expensive for the ADS search service. In these situations you should probably use the /search/bigquery API endpoint. The ads package automatically evaluates the expression given to ads.Document.select().where() and decides whether it should use the standard search endpoint, or the BigQuery endpoint. As a user, you don’t have to explicitly set this.

The only reason why you need to know about this is that the BigQuery API endpoint has a lower rate limit than the standard search API endpoint, which means you can’t make as many BigQuery API calls as you can make standard search calls. So if you run into an error where you have hit your API call limit for the day, but you don’t think you have made that many queries, it might be because you have hit the BigQuery API limit, but not the standard search limit.