AYLIEN NEWS API: A Starter Guide for Python Users

Download Jupyter notebook here

Table of Contents

Introduction

In this document, we will review four of the AYLIEN News API's most commonly used endpoints:

We will utilise AYLIEN's Python SDK (Software Development Kit) and also show you some helpful code to start wrangling the data in Python using Pandas and visualizing it using Plotly.

As an exercise, we will focus on pulling news stories related to Citibank, to show how these different endpoints can be used in combination to investigate a topic of your choice.

Please note, comprehensive documentation on how to use the News API can be found here.

Technical Set-Up

Here we will outline how to connect to AYLIEN's News API and define some useful functions to make pulling and analysing our data easier.

Configuring Your API Connection

First things first — we need to connect to the News API. Make sure that you have installed the aylien_news_api library using pip. The code below demonstrates how to connect to the API and also imports some other libraries that will be useful later.

Don't forget to enter your API credentials in order to connect to the API! If you don't have any credentials yet, you can sign up for a free trial here.

Define Functions to Pull Data

The Functions below will be used to pull the data from the API using get requests. In some cases, data will be returned as an array of objects e.g. the get_stories function. In others data will be returned as Pandas dataframes e.g. the get_timeseires function.

Define Other Useful Functions

These other functions will help us format data as necessary.

Making Your First Calls

The Stories Endpoint

The most granular data point we can extract from the News API is a story; all other endpoints are aggregations or extrapolations of stories. Stories are basically news articles that have been enriched using AYLIEN's machine learning prcoess. We will learn more about this enrichment later.

For now we will pull one story published in English in the last hour.

We can see that the story output is a list with one dictionary object representing the story we queried. The story object inlcudes the title, body text, summary sentences and lots of other contextual information that has been made available via AYLIEN's enrichment process.

We can loop through the object's key names to give us a flavour of what is available.

Refining Your Query

Using Keyword Search and the Cursor

Using a keyword search, we can search the AYLIEN database for words that appear in the title or body of an article. Here we will search for "Citigroup" in the title.

We will also limit the the date range — if we don't, we could return thousands of stories that feature "Citigroup" in the title — and define the language as English ("en"). Defining the language not only limits our output to English language content, it also allows the query to to remove any relevant stopwords. Learn about stopwords here.

We will also introduce the cursor. We don't know how many stories we'll get, and the cursor will allow us to scan through results. Learn more about using the cursor here.

The per_page parameter defines how many stories are returned for each API call, with 100 being the max.

The default parameters below will use relative times to ensure you can access recent news data (historical data is restricted). You can try changing the time periods by altering the paramters using the following formats:

Depending on what parameters you used (and of course, how much Citgroup featured in the news), your number of stories may vary. Let's print the first 10 titles to get a feel for the stories we have pulled.

What if we want to refine our keyword search further? We can create more complicated searches using Boolean statements. For instance, if we were interested in searching for news that mentioned Citigroup or Bank of America and that also mentioned "shares" but not "sell", we could write the following query. It is important to note here that the "Bank of America" search term is wrapped in double quotes — if it wasn't, each individual word would be treated as an indivudal search term, but we want to search for the full phrase.

Categorical Search - IPTC

We can see that we can refine our query by adding Boolean operators to our keyword search. However, this can become more complicated if we want to cast our net wider. For instance, let's say we want to pull stories about the banking sector in general. Rather than writing a complicated keyword search, we can search by a news category.

AYLIEN'S NLP enrichment classifies stories into categories to allow us to make more powerful searches. Our classifier is capable of classifying content into two taxonomies where a code corresponds with a a subject. Learn more here.

Here, we will search for all stories classified as "banking" (04006002) using the IPTC subject taxonomy. You can search for other IPTC codes here.

Many stories will be categorised under "banking", so we will restrict our output to the first 100.

We can also perform categorial search using the IAB taxonomy or the AYLIEN Smart Tagger which will be discussed later.

Sorting Your Query Response

You may find you want to sort your query response by some metric. In the examples above, we have taken the top N stories.

These have been sorted - by default - by published date i.e. we are getting the most recent N stories that meet our search criteria.

Sorting the query response is particularly useful when many stories meet our search criteria but we only want N stories. For example, say 1,000 stories met our search criteria - we could sort these stories by a range of metrics and return the top N.

We can use the following paramters to sort our response by:

You can read more about sorting in our docs.

The sort order by default is descending, but we can explictly state which direction we want to sort by using the 'sort_by' parameter.

In the following example, we perform a keyword search and sort by keyword relevance.

AYLIEN Query Language (AQL)

The AYLIEN Query Language (or AQL), is AYLIEN's custom 'flavour' of the Lucene syntax that enables users to make more powerful queries on our data.

Queries in this syntax are made within an 'aql' parameter.

AQL enables us to perform more sophisticated searches like boosting the importance of keywords and enhanced entity search.

Boost

When making a query with many keywords, sometimes one keyword in is more important to your search than others. Boosting enables you to add weight to the more important keyword/keywords so that results mentioning these keywords are given a “boost” to get them higher in the results order.

For example, searching ["John", "Frank", "Sarah"] gives equal weight to each term, but ["John", "Frank"^2, "Sarah"] is like saying a mention of “Frank” is twice as important as a mention of “John” or “Sarah”. Stories mentioning “Frank” will therefore appear higher in the rank of search results. We can reduce the importance of a keyword by attributing a decimal number e.g. 0.5.

Boosting is not the definitive keyword search input, simply allows the user to specify the preponderant keywords in a list (i.e. if a story contains many mentions of non-boosted searched keywords, it could still be returned ahead of many stories that mention a boosted keyword). Boosting therefore does not exclude stories from the results, it only affects the order of returned results.

The boost is allocated using the ^ symbol.

In the example below, we search for a wide variety of keywords but give special significance to the "radioactive" keyword.

Frequently, keywords of interest to us are mentioned in varying sequences of terms. For example, HSBC's division in China could appear in multiple forms: “HSBC China”, “HSBC’s branches in China”, “In China, HSBC is introducing new…” , etc.

Proximity search is a feature that enables user to broaden the search criteria to return these combinations. “Proximity” refers to the distance, in terms, between two searched terms in a story. For example, "HSBC China"~5 only returns stories that mention "HSBC" and "China", where there is a maximum of four words in between them.

AYLIEN Smart Tagger

AYLIEN leverages two industry standard taxonomies in our news categorisation but we also leverage our own propriertary taxonomy - the Smart Tagger.

Smart Tagger leverages state-of-the-art classification models that have been built using a vast collection of manually tagged news articles based on domain-specific industry and topical taxonomies. Smart Tagger uses a highly effective rule-based classification system for identifying categorical and industry-related news content.

As part of the Smart Tagger update we’re introducing 2 new classification taxonomies; the AYLIEN Industry Taxonomy and the AYLIEN Category Taxonomy, which incorporates 2 curated category groupings; Adverse Events and Trading Impact Events.

You can explore these taxonomies here.

AYLIEN Categories

A wide and deep collection of topical categories covering popular topics specifically curated for the business and finance world.

Search for categories using a categories label.

Search for a category using a category ID.

Search for one category but explictly omit another category

Search for a list of categories

Search for a category over a threshold of confidence and sort by this confidence

AYLIEN Industries

A robust collection of multilevel tags that represent the industry a news article is covering.

Users can seach for Industry verticals using similar syntax as AYLIEN Categories.

Search for Industries Using IDs

Search for Industries Using IDs

Working with Entities

Similarly, we may be interested in searching for certain recurring subjects appearing in the news for example, banks, companies, dogs or even aliens! We could do this using keyword search but AYLIEN provides a solution to this problem by classifying some words as "enties".

What is an entity? The Oxford English Dictionary provides a basic starting point of what an entity is, with its definition being "a thing with distinct and independent existence". Learn more about searching for entities here.

We can use entity types to search for groups of entities without the need for defining an exhaustive list of DBPedia links.

Returning to our query that pulled stories classifed as "banking", let's pull all articles categorised as banking that also feature a "Company" or "Bank" entity type in the title:

N.B. AYLIEN's knowlede base switched from using DBPedia (V2 entities) to Wikidata (V3 entities) in February 2021. If you recquire syntax relating to V2, please contact sales@aylien.com.

Let's look closely at the first story in this output and review the entities in the title.

Note, some entities will be linked to a Wikiedata URLs. AYLIEN uses Wikidata to train a vast knowledge base in order to identify entities.

Other entities may not be linked to a DBPedia URL. AYLEIN also utilises a Named Entity Recognisition Model to identify entities in cases where they can't be identified from the knowledge base.

Depending on your query, we should see that the classifier picked up some entities. We can also see some of the entities are linked to Wikidata URLs — we will return to this below.

We are not limited to working with entities in the title however. We can also search for entities in the body of the article. Let's print out the first 10 entities in the body. We can see that AYLIEN's enrichment process identifies a whole range of entity types.

Entity Search Using Wikipedia URL

We have seen how AYLIEN's NLP enrichment identifies entities and that some entities are tagged with a Wikidata URLs. Entities can be useful when a keyword or search term can refer to multiple entities. For example, let's imagine we are interested in finding news regarding the company, Apple — how do we restrict searches for the company only and ignore searches for the fruit? We could search for the keyword "Apple" and also search for company entity types as described above, but then we would run the risk of returning titles that include companies other than Apple Inc. but that mention the fruit, apple. We can, however, perform a more specific search using Wikidata and Wikipedia URLs.

Wikidata is a semantic web project that extracts structured information created as part of the Wikipedia project where distinct entities are referred to by URIs (like https://en.wikipedia.org/wiki/Apple_Inc. and https://www.wikidata.org/wiki/Q312). Using these URIs, we can perform very specific searches for topics and reduce the ambiguity in our query. Searching by URI will also identify different surface forms that link to Apple e.g. "Apple", "Apple Inc." and the Apple stock ticker, "AAPL".

Below, we'll demonstrate a search for Citigroup using its Wikiedpia URL.

N.B. AYLIEN's knowlede base switched from using DBPedia (V2 entities) to Wikidata (V3 entities) in February 2021. If you recquire syntax relating to V2, please contact sales@aylien.com.

Search for an Entity by QID

We can search for entities using their Wikidata ID as per below.

Search for an Entity by Surface Form

Sometimes we might want to search for an entity by surface form (i.e. the text metnioned) rather than the wiki ID. This may because we want to limit to a certain surface form (MSFT and not Microsoft) or becuase the entity is not in wikidata and so not in our kenoweldege base. Our Named Entity Recognition model and still recognise entities that are not in wikidata, based on the context of the document. This is useful for searching for lesser known companies, SMEs or start-ups.

In the code below I use the code surface_forms.text - this is a full text search. This means that

In contrast, searching via surface_forms on its own will perform an exact string match search i.e. case sensitive with special characters included.

Search for an Entity by Stock Ticker

We can search for entities using their stock ticker (where supported).

Search for an Entity Specifying Entity Type

Sometimes if we are searching for an entity surface form, we may want to specify the entity type to help identify the correct entity. This may be becuase the entity is not recognised in wikidata and therefore not in the AYLIEN knowledge base.

However, our Named Entity Recognistion model can predict what entity type the entity is (i.e. Person, Organization, Location etc.) even if it is not in wikidata. This enables us to search for entity surface forms and explictly state what type of entity they should be.

Below we searcg for the surface form "Apple" and specify that we are looking for an Organization entity type.

Search for an Entity Specying Title or Body Element

We can specify where in the article we want to find the entity by specifying the title or body elements.

Searching For Multiple Entities at Once

We can add logic to search for multiple entities at once. Note in this example we are using the OR operator to search for one of two entities.

Searching by Entity and Entity Level Sentiment Analysis

We can also limit to the stories we want by enttiy sentiment, as exemplified below. Here we will search for negative mentions of Citigroup.

Here we will isolate the Citigroup entity in the first story to show it is classified with negative sentiment.

Entity Prominence

Entity prominence is a measure of how significant a mention of an entity is on a scale of 0-1.

Intuitively - as consumers of news - we know if an entity appears in the title, in the first paragaph or many times in an article, then it is pretty significant. AYLIEN's entioty prominence metric catpures this signficance.

We can use this as a query paramter to filter out insignificant mentions of an entity by setting an entity prominence threshold. We can also sort by entity prominence to see the most significant mentions first. For more ways to sort your query output see here.

Non-English Content

So far we have pulled stories in English only. However, our News API supports 6 native languages and 10 translated languages:

Native Languages:

Translated Languages:

Let's perform a search in some native languages other than English. Here we'll search for stories featuring Citigroup in the title and print the native language title and an English title.

Create a Pandas Dataframe From a List of Stories Dictionaries

Up to now we have interrogated our News API output by converting the JSON objects to Python dictionaries, iterating through them and printing the elements. Sometimes we may wish to view the data in a more tabular format. Below, we will loop through our non-English content stories and create a Pandas dataframe. This will also be useful later when we want to visualize our data.

We'll also pull out some contextual information about each story such as the article's permalink and the stories' sentiment score. AYLIEN's enrichment process predicts the overall sentiment in the body and title of a document as positive, negative and neutral and also outputs a confidence score.

The Timeseries Endpoint

Pull Timeseries

We have seen how we can pull granular stories using the Stories endpoint. However, if we want to investigate volumes of stories over time, we can use the Timeseries endpoint. This endpoint retrieves the stories that meet our criteria and aggregates per minute, hour, day, month, or however we see fit. This can be very usfeul for identifying spikes or dips in news volume relating to a subject of interest. By default, our query below will aggregate the volume of stories per day.

The timeseries endpoint ouputs data in a json format, but out function above will convert this to a pandas dataframe for legibility.

Visualizing Timeseries

We can makes sense of timeseries data much quicker if we visualize it. Below, we make use out of Plotly library to visualize the data.

Exploring Spikes in Timeseries Data

We can see from the graph that there are various spikes in news volume. We can explore the cause of these spikes by pulling a story that will give us an indication of why Citigroup received so much attention using Alexa Ranking. Alexa Ranking is an estimate of a site's popularity on the internet. Learn more about working with Alexa Ranking here.

Below, we'll identify the three dates with the most stories, then pull the highest ranked story for those dates using the same parameters we used to query the Timeseries endpoint.

Add Labels to Timeseries Spikes

We will now append these titles to the spikes in the graph we previously created. If we hover over the markers, the tooltip will display the relvant story title.

Pull Document Timeseries by Sentiment

We filter our timeseries queries in the same ways as stories, but one filter that is particularly interesting is filtering on sentiment. We have already discussed how stories are given a sentiment score at a granular level and we can use this score to pull volume of stories by title sentiment polarity over time.

In the cell below, we run a function that pulls queries the Timeseries endpoint twice — once for positive sentiment stories and once for negative stories.

Visualizing Timeseries by Sentiment

Visualizing Entity Timeseries by Sentiment

We can also track entity level sentiment over time.

Similar to the Timeseries endpoint, we may be interested in seeing themes and patterns over time that aren't immediately apparent when looking at individual stories. The Trends endpoint allows us to see the most frequently recurring entities, concepts or keywords that appear in articles that meet our search criteria.

Below we will pull the most frequently occuring entities in the body of stories mentioning Citigroup over a month.

Note- this query will take longer to run than previous endpoints as the News API is performing analysis on all entities included in all the stories that meet our search citeria.

We can visualize the output of the Trends endpoint as a wordcloud to help us quickly interpret the most prevalent keywords.

We have used a wordcloud to invesitage the most prominent entities in a one month period, but what if we want to investigate the frequency of mentions over time?

We can loop over the Trends endpoint and create a timeseries to investigate the distribution of entities over time.

First we will create a function to create a list of tupples containing daily intervals to allow us to search for trends daily within a defined period.

Next, we will define our date range, create a list of date tupples and iterate over those daily intervals to populate a dataframe that relates the entity, the number of times it was mentioned and the day the mentions occurred.

We can loop over this dataframe and visualize the distribution of the different entities. Note, the code below visualizes only the top 10 entities.

The Clusters Endpoint

Naturally, multiple news stories will exist that report on the same or similar topics. AYLIEN's clustering enrichment groups stories together that typically correspond to real-world events or topics. Clusters are made of stories that exist close to one another in vector space and the clustering enrichment links clusters to a "representative story" that exists in the centre of each cluster — reading this representative story provides an indication of the general nature of the entire cluster.

Similar to the timeseries and Trends endpoints, clusters enable us to review stories over time and identify points of interest. We can search for individual clusters using a a cluster ID, but similar to stories, we will generally not know the IDs of interest before we find them. Consequently, we can search for clusters using the Trends endpoint. The Trends endpoint allows us to filter clusters based on the stories contained within the clusters.

The Trends endpoint returns the id of clusters sorted by the count of stories associated with them. Once we have each cluster’s id, you can go on to get the stories for each of the clusters from the Stories endpoint. The Trends endpoint only returns the top 100 clusters for a given query.

The following script identifies clusters of news that feature the Citigroup entitiy using the Trends endpoint and returns the top 3 stories in each cluster, ranked by Alexa ranking.

If we look at the first 3 clusters returned, we can see the number of stories associated with each cluster, the representative story title and the top 3 ranked stories.

Visualizing Cluster Data

We can easily visualize the cluster data to make it more easily digestable and understandable. Below we'll convert it to a Pandas dataframe and then visualize with Plotly.

Visualize the Clusters in a scatterplot

Conclusion

Here we have given a quick introduction in how to get up and running with four of the AYLIEN News' API's most frequently used endpoints. With these code and visualization examples, you should be able to start exploring news data in no time!