Documents¶
The ads.Document
object¶
The ads.Document
object is used to represent a record in ADS. We use this data model and an object-relational mapper (ORM) to search and filter for documents on ADS in a programmatic way. For example, you can write queries like
expression = (Document.year < 2020) & Document.title.like("JWST")
and have them translated into a (Solr) query that you would input through the ADS web interface.
Fields and operators¶
The ADS search engine has some fields and operators that are searchable but not viewable. That means you can search for documents using this field, but when you retrieve a document that field will be None
.
The ads.Document
data model reflects this functionality as closely as possible by using:
typed fields for searchable and viewable Solr fields
for example,
ads.Document.id
is anIntegerField
virtual fields for those that are searchable but not viewable, and
for example,
ads.Document.ack
is aVirtualField
functions for Solr operators.
for example,
ads.Document.similar()
.
Tip
All typed fields, virtual fields, and functions can be accessed as attributes of the ads.Document
object.
That means if you’re using an interactive Python environment (e.g., a python shell, an iPython shell, or Jupyter) then you can access all of these by typing ads.Document
and using :kbd:<tab>
completion.
The following fields can be searched and viewed:
The document abstract. |
|
The raw, provided affiliation field. |
|
Curated affiliation identifier, parsed from the given affiliation string. |
|
An alternate bibcode. |
|
Alternate title, usually present when the original title is not in English. |
|
The arXiv class a document was submitted to. |
|
Author name. |
|
The number of authors. |
|
Author name in the form 'Lastname, F'. |
|
Document bibliographic code. |
|
Records by bibliographic groups, curated by staff outside of ADS. |
|
Abbreviated name of the journal or publication (e.g., ApJ). |
|
The number of citations to this document. |
|
Normalised citation count. |
|
The normalized boost factor. |
|
A classic prestige score, designed to more highly rank papers that are relevant and popular now. |
|
Related data sources. |
|
The database the document resides in (e.g., astronomy or physics). |
|
Publication date, represented by a time format and used for indexing. |
|
Document type (e.g., article, thesis, etc. |
|
Digital object identifier |
|
Electronic identifier of the paper, which is the equivalent of a page number. |
|
Email addresses of the authors. |
|
Creation date of the ADS record. |
|
Types of electronic sources available for a record (e.g., |
|
Facilities declared in a record, based on a controlled list by AAS journals. |
|
Search by grant identifiers and grant agencies ( |
|
A field with just the grant agencies name (e.g., NASA). |
|
Search by grant identifier. |
|
A unique identifier for the document, curated by ADS. |
|
Search by an array of alternative identifiers for a record. |
|
Datetime when the document was last index by the ADS Solr service. |
|
International Standard Book Number |
|
International Standard Serial NUmber |
|
Issue number of the journal that includes the article. |
|
An array of normalized and non-normalized keyword values associated with the record. |
|
The language of the main title. |
|
Information on what linked documents are available. |
|
List of NED IDs for a record. |
|
Keywords used to describe the NED type (e.g., galaxy, star). |
|
ORCID claims from users who used the ADS claiming interface. |
|
ORCIDs supplied by publishers. |
|
ORCID claims from users who gave ADS consent to expose their public profile. |
|
Page number of a record. |
|
The difference between the first and last page numbers in |
|
An array of miscellaneous flags associated with a record. |
|
The canonical name of the publication that the record appeared in. |
|
Name of the publisher, but also includes the volume, page, and issue if exists. |
|
Publication date in the form YYYY-MM-DD, where DD will always be '00'. |
|
The closeness of the document to the query match. |
|
List of SIMBAD IDs within a document. |
|
Keywords used to describe the SIMBAD type. |
|
The number of times the record has been viewed within a 90 day window. |
|
The title of the record. |
|
Keywords, 'subject' tags from Vizier. |
|
The journal volume. |
|
Year of publication. |
Important
Just because a field is viewable doesn’t mean you will always get a value. For example, if a record has no email address information about the authors, then doc.email
will return something like ['-', '-']
for a two-author record. In other situations the value returned might just be None
.
These virtual fields can be searched, but not viewed:
Search for a word or phrase in |
|
Search by |
|
Search by arXiv identifier. |
|
Search by |
|
Search by ORCID identifier, from all possible sources: |
These functions access Solr operators to search for documents, but don’t return a value:
Return documents that cite documents returned by the given expression. |
|
A different implementation of |
|
A toy implementation of the ADS Classic relevance score algorithm. |
|
Return documents that are the synonymn of |
|
This operator is the equivalent of |
|
This operator is the equivalent of |
|
Return documents where the given expression is matched only by an item in a given position, or range of positions (e.g., author or affiliation position). |
|
Return documents that are cited by documents in the given expression. |
|
The reviews operator takes the list of articles which cited the papers in the given expression, combines them, and returns this list sorted by how frequently a citing paper appears in the combined list. |
|
Return documents that have abstracts similar in wording to documents returned by the given expression. |
|
Return the top n documents matching the given expression. |
|
Query documents that are most read by users who read recent papers on the topic being researched. |
|
Return documents frequently cited by the most relevant papers returned by the given expression. |
The ads.Document.useful2
and ads.Document.reviews2
operators are also available, but ‘not’ listed here to avoid confusion.
Search for documents¶
The ads.Document
has a ads.Document.select()
function to select (search for) documents. If you’ve used a SQL database before, you’ll notice this approach is very similar to how you would select records from a SQL database.
When you call ads.Document.select
it returns a ModelSelect
object. This object has everything it needs to perform the ADS search, but it won’t actually do anything until you try to access documents (e.g., by iterating over the ModelSelect
object). For example:
from ads import Document
# The .where() and .limit() functions will be explained below.
docs = (
Document.select()
.where(
Document.author == "Ness, M"
)
.limit(3)
)
print(f"# {type(docs)}")
# <class 'ads.ModelSelect'>
# *Nothing* has been sent to ADS yet.
# The request is only executed when we try to access the results:
for doc in docs: # Now the request is made to ADS
print(f"# {doc}")
# <Document: bibcode=2017AJ....154...28B>
# <Document: bibcode=2018ApJS..235...42A>
# <Document: bibcode=2017ApJS..233...25A>
We can access (or iterate over) the documents docs
as many times as we want, but the query is only executed to ADS once.
Select fields¶
If you only want specific fields returned by ADS, you can explicitly give these to ads.Document.select()
. If you want all the fields of a document then don’t supply anything to .select()
. For example:
from ads import Document
# Give me everything (this is the usual case).
everything = Document.select()
# I only want bibcodes, author counts, and citation counts.
something = Document.select(
Document.bibcode, Document.author_count, Document.citation_count
)
If you made the second query above (something
) and then later decided you also needed the title of a document (or some other field), don’t worry: the data attributes on ads.Document
are special and will lazily retrieve the missing data for you. This is considered bad practice because it increases the number of API calls you use, and for this reason you will see a (single) warning.
Note that even if we select only single field, when we iterate over ads.Document.select()
we will always get a ads.Document
object. Below is an example. The input Python code is shown in the first tab, and the second tab shows the output.
from ads import Document
# Give me citation counts for these documents
docs = (
Document.select(
Document.citation_count,
).where(
Document.bibcode.in_(["1996PhRvL..77.3865P", "1996PhRvB..5411169K"])
)
)
for doc in docs:
# Oh shit, I want their titles too.
print(f"{doc.bibcode}: {doc.title[0]} ({doc.citation_count:,} citations)")
LazyAttributesWarning: You're lazily loading document attributes, which makes many calls to the API. This will impact your rate limits. If you know what document fields you want ahead of time, provide them as arguments to `Document.select()`.
1996PhRvL..77.3865P: Generalized Gradient Approximation Made Simple (61,563 citations)
1996PhRvB..5411169K: Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set (34,986 citations)
The ads.Document.bibcode
and ads.Document.id
will always be requested by ads.Document.select
, even if you didn’t ask for them, because these fields are required to make future queries against ADS, and to evaluate whether two documents are different.
Important
If you’re only lazily loading a few fields, or a single field for a few documents, then that’s OK. But you should know that lazy loading a single field in a single document requires an API call. If you are lazily loading 5 fields in 10 documents, that’s 50 API calls. For contrast, a single API call to the ADS search service can return 200 documents with all fields.
Lazy loading is a feature to help you code interactively without having to repeat complex searches, but it is not intended to be used often.
Filtering¶
By default, ads.Document.select()
on it’s own would return an iterable that would (eventually) retrieve every document from ADS. That’s probably not what you want; you should provide some filter to return particular documents.
The way we filter for documents is by using the .where()
function with ads.Document.select()
. Here is an example:
from ads import Document
docs = (
Document.select()
.where(Document.year == 2015)
)
We can combine multiple filters using the standard &
(meaning ‘and’) and |
(meaning ‘or’) operators in Python. For example, let’s look for documents authored by “Lastname, First” in 2015 or 2018-2019:
from ads import Document
docs = (
Document.select()
.where(
(Document.author == "Lastname, First")
& (
Document.year.between(2018, 2019)
| (Document.year == 2015)
)
)
)
You can include any searchable field in the expression you give to .where()
. You can also include operators, which are accessible as functions:
from ads import Document
docs = (
Document.select()
.where(
Document.trending(Document.title == "exoplanets")
& Document.author_count.between(1, 5)
)
)
Important
Although you may be tempted to use python’s in
, and
, or
, and not
operators in your query expressions, these will not work.
The return value of an in
expression is always coerced to a boolean value by Python. Similarly, and
, or
, and not
all treat their arguments as boolean values and cannot be overloaded. So as a guide:
Use
|
instead ofor
Use
&
instead ofand
Use
~
instead ofnot
Use
.in_()
and.not_in()
instead ofin
andnot in
You should always wrap each expression in brackets, too. This is because of the way Python treats operator precedence. That means:
Good:
(Document.year == 2000) & (Document.title.like("JWST"))
Bad:
Document.year == 2000 & Document.title.like("JWST")
Ordering¶
The default ordering for documents returned by ADS is by relevance (computed by the ADS search engine). Instead, you can sort by any of the following fields:
A unique identifier for the document, curated by ADS. |
|
The number of authors. |
|
Document bibliographic code. |
|
The number of citations to this document. |
|
Normalised citation count. |
|
A classic prestige score, designed to more highly rank papers that are relevant and popular now. |
|
Publication date, represented by a time format and used for indexing. |
|
Creation date of the ADS record. |
|
The number of times the record has been viewed within a 90 day window. |
|
The closeness of the document to the query match. |
Use the .order_by()
function to order results:
from ads import Document
# Get the freshest bangerz
docs = (
Document.select()
.order_by(
Document.entry_date.desc()
)
)
# Sort by multiple ordered fields
docs = (
Document.select()
.order_by(
Document.author_count.desc(),
Document.citation_count.asc()
)
)
Limits¶
It’s good practice to set a limit for the number of documents you want. This reduces the load on the ADS search service, and it helps the ads
package to paginate queries ahead of time.
Most of the search examples we have shown so far don’t have any limits applied. That means they will return every document found by ADS that matches the expression.
We can limit the number of documents retrieved by applying a .limit()
to our .select()
call:
from ads import Document
docs = (
Document.select()
.where(
(Document.year > 2020)
)
.order_by(
Document.citation_count.desc()
)
.limit(10)
)
# Or if we have no filters or sort:
docs = Document.select().limit(3)
Advanced document searches¶
You can search for documents using expressions that include other data model objects. For example:
Use explicit Solr queries¶
If all of this search syntax with Document.select()
and .where()
and .order_by()
and .limit()
is too frightening or confusing, you can always give an explicit search query to ADS. The way to do this is the same as how you performed searches in previous versions of the ads
package, in order to remain as backwards-compatible as possible.
Here’s how you would give an explicit search query:
from ads import Document, SearchQuery
docs = SearchQuery(
q="author:'Ness, M' AND year:2018",
fl=["bibcode", "title", "author", "citation_count"]
)
# SearchQuery will return a `ModelSelect` object,
# which you can use the same as a `ads.ModelSelect` object
for doc in docs:
print(doc)
# Here's the same query in the "new" way, except we retrieve all fields
docs = (
Document.select()
.where(
(Document.author == "Ness, M")
& (Document.year == 2018)
)
)
If you’re upgrading from ads
0.12.3 then your existing code might still work, but you’ll notice some name changes. For example, when you iterate over a function call of ads.SearchQuery
you will get ads.Document
objects, not ads.Article
objects. If you have trouble replicating your existing queries in the new search format, please create an issue on GitHub.
BigQuery¶
Most ADS searches use the /search/query
API endpoint. However, if the search requires checking for a large number of explicit bibcodes, this can be expensive for the ADS search service. In these situations you should probably use the /search/bigquery
API endpoint. The ads
package automatically evaluates the expression given to ads.Document.select().where()
and decides whether it should use the standard search endpoint, or the BigQuery endpoint. As a user, you don’t have to explicitly set this.
The only reason why you need to know about this is that the BigQuery API endpoint has a lower rate limit than the standard search API endpoint, which means you can’t make as many BigQuery API calls as you can make standard search calls. So if you run into an error where you have hit your API call limit for the day, but you don’t think you have made that many queries, it might be because you have hit the BigQuery API limit, but not the standard search limit.