AYLIEN NEWS API: A Starter Guide for Python Users

Download Jupyter notebook here

Introduction

In this document, we will review four of the AYLIEN News API's most commonly used endpoints:

  • stories (pull articles that have been enriched by AYLIEN's NLP technology)
  • timeseries (pull the volume of stories that meet your query over time)
  • trends (identify the most prevalent entities, concepts or keywords that appear in the stories that meet your criteria)
  • clusters (identify clusters of similar news stories to investigate events)

We will utilise AYLIEN's Python SDK (Software Development Kit) and also show you some helpful code to start wrangling the data in Python using Pandas and visualizing it using Plotly.

As an exercise, we will focus on pulling news stories related to Citibank, to show how these different endpoints can be used in combination to investigate a topic of your choice.

Please note, comprehensive documentation on how to use the News API can be found here.

Configuring Your API Connection

First things first — we need to connect to the News API. Make sure that you have installed the aylien_news_api library using pip. The code below demonstrates how to connect to the API and also imports some other libraries that will be useful later.

Don't forget to enter your API credentials in order to connect to the API! If you don't have any credentials yet, you can sign up for a free trial here.

In [1]:
from __future__ import print_function


# install packages if not installed already
!pip install datetime
!pip install pandas
!pip install numpy
!pip install plotly 
!pip install aylien_news_api
!pip install chart_studio
!pip install tqdm
!pip install pprint

from datetime import datetime
from datetime import timedelta
from dateutil.tz import tzutc
import json
import time
import pandas as pd
import numpy as np
import math
from tqdm import tqdm
from pprint import pprint

# for visualization
import plotly.graph_objs as go
import chart_studio.plotly as py
from plotly.subplots import make_subplots

# import the AYLIEN news API library
import aylien_news_api
from aylien_news_api.rest import ApiException

configuration = aylien_news_api.Configuration()

# Configure API key authorization: app_id
configuration.api_key['X-AYLIEN-NewsAPI-Application-ID'] = 'YOUR_API_ID'

# Configure API key authorization: app_key
configuration.api_key['X-AYLIEN-NewsAPI-Application-Key'] = 'YOUR_API_KEY'

# Create an instance of the API class
api_instance = aylien_news_api.DefaultApi(aylien_news_api.ApiClient(configuration))

print('Complete')
Requirement already satisfied: datetime in c:\users\eoink\anaconda3\lib\site-packages (4.3)
Requirement already satisfied: pytz in c:\users\eoink\anaconda3\lib\site-packages (from datetime) (2019.3)
Requirement already satisfied: zope.interface in c:\users\eoink\anaconda3\lib\site-packages (from datetime) (5.1.0)
Requirement already satisfied: setuptools in c:\users\eoink\anaconda3\lib\site-packages (from zope.interface->datetime) (41.4.0)
Requirement already satisfied: pandas in c:\users\eoink\anaconda3\lib\site-packages (0.25.1)
Requirement already satisfied: numpy>=1.13.3 in c:\users\eoink\anaconda3\lib\site-packages (from pandas) (1.16.5)
Requirement already satisfied: pytz>=2017.2 in c:\users\eoink\anaconda3\lib\site-packages (from pandas) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in c:\users\eoink\anaconda3\lib\site-packages (from pandas) (2.8.0)
Requirement already satisfied: six>=1.5 in c:\users\eoink\anaconda3\lib\site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0)
Requirement already satisfied: numpy in c:\users\eoink\anaconda3\lib\site-packages (1.16.5)
Requirement already satisfied: plotly in c:\users\eoink\anaconda3\lib\site-packages (4.4.1)
Requirement already satisfied: chart_studio in c:\users\eoink\anaconda3\lib\site-packages (1.0.0)
Requirement already satisfied: aylien_news_api in c:\users\eoink\anaconda3\lib\site-packages (3.0.0)
Requirement already satisfied: six in c:\users\eoink\anaconda3\lib\site-packages (from plotly) (1.12.0)
Requirement already satisfied: retrying>=1.3.3 in c:\users\eoink\anaconda3\lib\site-packages (from plotly) (1.3.3)
Requirement already satisfied: requests in c:\users\eoink\anaconda3\lib\site-packages (from chart_studio) (2.22.0)
Requirement already satisfied: urllib3>=1.15 in c:\users\eoink\anaconda3\lib\site-packages (from aylien_news_api) (1.24.2)
Requirement already satisfied: certifi in c:\users\eoink\anaconda3\lib\site-packages (from aylien_news_api) (2019.9.11)
Requirement already satisfied: python-dateutil in c:\users\eoink\anaconda3\lib\site-packages (from aylien_news_api) (2.8.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\eoink\anaconda3\lib\site-packages (from requests->chart_studio) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in c:\users\eoink\anaconda3\lib\site-packages (from requests->chart_studio) (2.8)
Requirement already satisfied: chart_studio in c:\users\eoink\anaconda3\lib\site-packages (1.0.0)
Requirement already satisfied: six in c:\users\eoink\anaconda3\lib\site-packages (from chart_studio) (1.12.0)
Requirement already satisfied: retrying>=1.3.3 in c:\users\eoink\anaconda3\lib\site-packages (from chart_studio) (1.3.3)
Requirement already satisfied: plotly in c:\users\eoink\anaconda3\lib\site-packages (from chart_studio) (4.4.1)
Requirement already satisfied: requests in c:\users\eoink\anaconda3\lib\site-packages (from chart_studio) (2.22.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\eoink\anaconda3\lib\site-packages (from requests->chart_studio) (1.24.2)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\eoink\anaconda3\lib\site-packages (from requests->chart_studio) (2019.9.11)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\eoink\anaconda3\lib\site-packages (from requests->chart_studio) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in c:\users\eoink\anaconda3\lib\site-packages (from requests->chart_studio) (2.8)
Requirement already satisfied: tqdm in c:\users\eoink\anaconda3\lib\site-packages (4.36.1)
Requirement already satisfied: pprint in c:\users\eoink\anaconda3\lib\site-packages (0.1)
Complete

The Stories Endpoint

The most granular data point we can extract from the News API is a story; all other endpoints are aggregations or extrapolations of stories. Stories are basically news articles that have been enriched using AYLIEN's machine learning prcoess. We will learn more about this enrichment later.

For now we will pull one individual story using a pre-identified unique story ID.

In [2]:
# define the query parameters
opts = { 
  'id': [71142615]
}
try:
    # List Stories
    api_response = api_instance.list_stories(**opts)
except ApiException as e:
    print("Exception when calling DefaultApi->list_stories: %s\n" % e)
    

#convert the api response to a python dictionary
api_response = api_response.to_dict()

stories = api_response['stories']

pprint(stories)
[{'author': {'avatar_url': None, 'id': 941213, 'name': 'Cynthia Vaughn'},
  'body': 'AMC Entertainment (NYSE:AMC) was downgraded by investment analysts '
          'at Citigroup from a “buy” rating to a “sell” rating in a research '
          'report issued on Wednesday, MarketBeat reports.\n'
          '\n'
          'A number of other equities analysts have also recently issued '
          'reports on the company. Zacks Investment Research upgraded AMC '
          'Entertainment from a “hold” rating to a “buy” rating and set a '
          '$5.50 target price on the stock in a report on Friday, March 6th. '
          'B. Riley cut AMC Entertainment from a “buy” rating to a “neutral” '
          'rating and set a $3.50 price target for the company. in a research '
          'note on Wednesday. Benchmark downgraded shares of AMC Entertainment '
          'from a “buy” rating to a “hold” rating in a research report on '
          'Monday, March 16th. Barrington Research reduced their price '
          'objective on shares of AMC Entertainment from $12.00 to $7.00 and '
          'set an “outperform” rating on the stock in a research note on '
          'Tuesday, March 10th. Finally, Imperial Capital decreased their '
          'price objective on shares of AMC Entertainment from $21.00 to '
          '$20.00 and set an “outperform” rating for the company in a report '
          'on Tuesday, January 7th. Two investment analysts have rated the '
          'stock with a sell rating, seven have issued a hold rating and seven '
          'have assigned a buy rating to the stock. AMC Entertainment '
          'currently has an average rating of “Hold” and an average price '
          'target of $10.38.\n'
          '\n'
          'Shares of AMC Entertainment stock traded down $0.18 on Wednesday, '
          'hitting $3.19. 5,741,596 shares of the company’s stock traded '
          'hands, compared to its average volume of 5,638,135. The firm’s '
          '50-day moving average price is $5.71 and its two-hundred day moving '
          'average price is $8.09. The company has a debt-to-equity ratio of '
          '8.02, a current ratio of 0.35 and a quick ratio of 0.35. The stock '
          'has a market capitalization of $351.29 million, a PE ratio of -1.83 '
          'and a beta of 0.62. AMC Entertainment has a one year low of $1.95 '
          'and a one year high of $17.07.\n'
          '\n'
          'Institutional investors have recently added to or reduced their '
          'stakes in the company. Verus Capital Partners LLC bought a new '
          'position in AMC Entertainment during the 4th quarter worth about '
          '$99,000. Sunbelt Securities Inc. acquired a new stake in AMC '
          'Entertainment in the 4th quarter valued at approximately '
          '$7,584,000. PVG Asset Management Corp acquired a new stake in AMC '
          'Entertainment in the 4th quarter valued at approximately $910,000. '
          'Geode Capital Management LLC lifted its stake in AMC Entertainment '
          'by 2.6% in the 4th quarter. Geode Capital Management LLC now owns '
          '680,858 shares of the company’s stock valued at $4,929,000 after '
          'acquiring an additional 17,400 shares in the last quarter. Finally, '
          'Aristeia Capital LLC lifted its stake in AMC Entertainment by 12.2% '
          'in the 4th quarter. Aristeia Capital LLC now owns 230,270 shares of '
          'the company’s stock valued at $1,667,000 after acquiring an '
          'additional 25,000 shares in the last quarter. Institutional '
          'investors own 46.25% of the company’s stock.\n'
          '\n'
          'AMC Entertainment Holdings, Inc, through its subsidiaries, involved '
          'in the theatrical exhibition business. The company owns, operates, '
          'or has interests in theatres. As of December 31, 2018, it owned, '
          'operated, or had interests in 637 theatres with a total of 8,114 '
          'screens in the United States; and 369 theatres and 2,977 screens in '
          'European markets.\n'
          '\n'
          'Receive News & Ratings for AMC Entertainment Daily - Enter your '
          'email address below to receive a concise daily summary of the '
          "latest news and analysts' ratings for AMC Entertainment and related "
          "companies with MarketBeat.com's FREE daily email newsletter.",
  'categories': [{'confident': True,
                  'id': 'IAB13',
                  'level': 1,
                  'links': {'_self': 'https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB13',
                            'parent': None},
                  'score': 0.44,
                  'taxonomy': 'iab-qag'},
                 {'confident': True,
                  'id': 'IAB13-11',
                  'level': 2,
                  'links': {'_self': 'https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB13-11',
                            'parent': 'https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB13'},
                  'score': 0.2,
                  'taxonomy': 'iab-qag'},
                 {'confident': True,
                  'id': '04016009',
                  'level': 3,
                  'links': {'_self': 'https://api.aylien.com/api/v1/classify/taxonomy/iptc-subjectcode/04016009',
                            'parent': 'https://api.aylien.com/api/v1/classify/taxonomy/iptc-subjectcode/04016000'},
                  'score': 0.12,
                  'taxonomy': 'iptc-subjectcode'}],
  'characters_count': 3527,
  'clusters': [107981937],
  'entities': {'body': [{'indices': [[3425, 3431]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/Nielsen_ratings'},
                         'score': 0.9303264021873474,
                         'text': 'ratings',
                         'types': ['Systems', 'Place']},
                        {'indices': [[1750, 1770]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/Market_capitalization'},
                         'score': 1.0,
                         'text': 'market capitalization',
                         'types': ['Value', 'Company']},
                        {'indices': [[720, 729]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/Barrington,_Illinois'},
                         'score': 0.4156203866004944,
                         'text': 'Barrington',
                         'types': ['Location',
                                   'PopulatedPlace',
                                   'Place',
                                   'Settlement',
                                   'Village']},
                        {'indices': [[0, 16],
                                     [303, 319],
                                     [449, 465],
                                     [619, 635],
                                     [783, 799],
                                     [980, 996],
                                     [1252, 1268],
                                     [1362, 1378],
                                     [1832, 1848],
                                     [2049, 2065],
                                     [2159, 2175],
                                     [2282, 2298],
                                     [2402, 2418],
                                     [2660, 2676],
                                     [2921, 2937],
                                     [3300, 3316],
                                     [3437, 3453]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/AMC_Theatres'},
                         'score': 1.0,
                         'text': 'AMC Entertainment',
                         'types': ['Chain',
                                   'Organisation',
                                   'Agent',
                                   'Company']},
                        {'indices': [[1794, 1801]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/Price–earnings_ratio'},
                         'score': 1.0,
                         'text': 'PE ratio',
                         'types': ['Ratio']},
                        {'indices': [[3201, 3213]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/United_States'},
                         'score': 0.9998650550842285,
                         'text': 'United States',
                         'types': ['Location',
                                   'Country',
                                   'Person',
                                   'Republic',
                                   'PopulatedPlace',
                                   'Place']},
                        {'indices': [[1532, 1539]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/The_Firm_(1993_film)'},
                         'score': 0.2951214909553528,
                         'text': 'The firm',
                         'types': ['Film',
                                   'Product',
                                   'Wikidata:Q11424',
                                   'Work']},
                        {'indices': [[1550, 1563], [1604, 1617]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/Moving_average'},
                         'score': 0.9911896586418152,
                         'text': 'moving average',
                         'types': ['Calculation']},
                        {'indices': [[2353, 2357], [2448, 2452]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/Geode'},
                         'score': 0.5807703137397766,
                         'text': 'Geode',
                         'types': ['Structures', 'Building']},
                        {'indices': [[19, 22]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/New_York_Stock_Exchange'},
                         'score': 0.9989675283432007,
                         'text': 'NYSE',
                         'types': ['Exchange',
                                   'ArchitecturalStructure',
                                   'Building',
                                   'Location',
                                   'Place',
                                   'Company',
                                   'Organisation']},
                        {'indices': [[2150, 2154],
                                     [2273, 2277],
                                     [2393, 2397],
                                     [2651, 2655]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/Stake_(Latter_Day_Saints)'},
                         'score': 0.6769386529922485,
                         'text': 'stake',
                         'types': ['Unit', 'Organisation']},
                        {'indices': [[70, 78]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/Citigroup'},
                         'score': 1.0,
                         'text': 'Citigroup',
                         'types': ['Organisation', 'Bank', 'Agent', 'Company']},
                        {'indices': [[0, 2],
                                     [24, 26],
                                     [303, 305],
                                     [449, 451],
                                     [619, 621],
                                     [783, 785],
                                     [980, 982],
                                     [1252, 1254],
                                     [1362, 1364],
                                     [1832, 1834],
                                     [2049, 2051],
                                     [2159, 2161],
                                     [2282, 2284],
                                     [2402, 2404],
                                     [2660, 2662],
                                     [2921, 2923],
                                     [3300, 3302],
                                     [3437, 3439]],
                         'links': {'dbpedia': 'http://dbpedia.org/resource/AMC_(TV_channel)'},
                         'score': 0.6953635811805725,
                         'text': 'AMC',
                         'types': ['Cable',
                                   'Broadcaster',
                                   'Organisation',
                                   'Agent',
                                   'TelevisionStation']},
                        {'indices': [[2130, 2132], [2949, 2951]],
                         'links': None,
                         'score': None,
                         'text': 'Inc',
                         'types': ['Place']},
                        {'indices': [[436, 443]],
                         'links': None,
                         'score': None,
                         'text': 'B. Riley',
                         'types': ['Person']},
                        {'indices': [[1352, 1357]],
                         'links': None,
                         'score': None,
                         'text': 'Shares',
                         'types': ['Person']},
                        {'indices': [[268, 292]],
                         'links': None,
                         'score': None,
                         'text': 'Zacks Investment Research',
                         'types': ['Organisation']},
                        {'indices': [[588, 596]],
                         'links': None,
                         'score': None,
                         'text': 'Benchmark',
                         'types': ['Organisation']},
                        {'indices': [[720, 738]],
                         'links': None,
                         'score': None,
                         'text': 'Barrington Research',
                         'types': ['Organisation']},
                        {'indices': [[918, 933]],
                         'links': None,
                         'score': None,
                         'text': 'Imperial Capital',
                         'types': ['Organisation']},
                        {'indices': [[1794, 1795]],
                         'links': None,
                         'score': None,
                         'text': 'PE',
                         'types': ['Organisation']},
                        {'indices': [[1910, 1922], [2861, 2873]],
                         'links': None,
                         'score': None,
                         'text': 'Institutional',
                         'types': ['Organisation']},
                        {'indices': [[1997, 2022]],
                         'links': None,
                         'score': None,
                         'text': 'Verus Capital Partners LLC',
                         'types': ['Organisation']},
                        {'indices': [[2111, 2132]],
                         'links': None,
                         'score': None,
                         'text': 'Sunbelt Securities Inc',
                         'types': ['Organisation']},
                        {'indices': [[2236, 2256]],
                         'links': None,
                         'score': None,
                         'text': 'Asset Management Corp',
                         'types': ['Organisation']},
                        {'indices': [[2353, 2380], [2448, 2475]],
                         'links': None,
                         'score': None,
                         'text': 'Geode Capital Management LLC',
                         'types': ['Organisation']},
                        {'indices': [[2619, 2638], [2707, 2726]],
                         'links': None,
                         'score': None,
                         'text': 'Aristeia Capital LLC',
                         'types': ['Organisation']},
                        {'indices': [[2921, 2946]],
                         'links': None,
                         'score': None,
                         'text': 'AMC Entertainment Holdings',
                         'types': ['Organisation']},
                        {'indices': [[3273, 3294]],
                         'links': None,
                         'score': None,
                         'text': 'Receive News & Ratings',
                         'types': ['Organisation']},
                        {'indices': [[3300, 3330]],
                         'links': None,
                         'score': None,
                         'text': 'AMC Entertainment Daily - Enter',
                         'types': ['Organisation']},
                        {'indices': [[3482, 3495]],
                         'links': None,
                         'score': None,
                         'text': 'MarketBeat.com',
                         'types': ['Organisation']}],
               'title': [{'indices': [[40, 43]],
                          'links': {'dbpedia': 'http://dbpedia.org/resource/New_York_Stock_Exchange'},
                          'score': 1.0,
                          'text': 'NYSE',
                          'types': ['Exchange',
                                    'ArchitecturalStructure',
                                    'Building',
                                    'Location',
                                    'Place',
                                    'Company',
                                    'Organisation']},
                         {'indices': [[21, 23], [45, 47]],
                          'links': {'dbpedia': 'http://dbpedia.org/resource/Volkswagen'},
                          'score': 1.0,
                          'text': 'AMC',
                          'types': ['Manufacturer',
                                    'Organisation',
                                    'Agent',
                                    'Company']},
                         {'indices': [[21, 37]],
                          'links': {'dbpedia': 'http://dbpedia.org/resource/AMC_Theatres'},
                          'score': 1.0,
                          'text': 'AMC Entertainment',
                          'types': ['Chain',
                                    'Organisation',
                                    'Agent',
                                    'Company']},
                         {'indices': [[0, 8]],
                          'links': {'dbpedia': 'http://dbpedia.org/resource/Citigroup'},
                          'score': 1.0,
                          'text': 'Citigroup',
                          'types': ['Organisation',
                                    'Bank',
                                    'Agent',
                                    'Company']},
                         {'indices': [[0, 37]],
                          'links': None,
                          'score': None,
                          'text': 'Citigroup Downgrades AMC Entertainment',
                          'types': ['Organisation']}]},
  'hashtags': ['#AMCTheatres',
               '#Stake',
               '#Citigroup',
               '#NYSE',
               '#NewYorkStockExchange',
               '#AMC',
               '#AMC',
               '#MovingAverage',
               '#Geode',
               '#Stock',
               '#BarringtonIllinois',
               '#WeightedArithmeticMean',
               '#TheFirm',
               '#MarketCapitalization',
               '#Price–earningsRatio',
               '#UnitedStates',
               '#NielsenRatings'],
  'id': 71142615,
  'keywords': ['Citigroup',
               'Entertainment',
               'NYSE',
               'AMC',
               'AMC Entertainment',
               'company',
               'rating',
               'stock',
               'price',
               'report',
               'ratings',
               'market capitalization',
               'Barrington',
               'average rating',
               'PE ratio',
               'United States',
               'The firm',
               'equities',
               'moving average',
               'Geode',
               'stake'],
  'language': 'en',
  'links': {'canonical': None,
            'coverages': '/coverages?story_id=71142615',
            'permalink': 'https://www.com-unik.info/2020/03/22/citigroup-downgrades-amc-entertainment-nyseamc-to-sell.html',
            'related_stories': '/related_stories?story_id=71142615'},
  'media': [{'content_length': 14333,
             'format': 'JPEG',
             'height': 220,
             'type': 'image',
             'url': 'https://www.marketbeat.com/logos/amc-entertainment-holdings-inc-logo.jpg',
             'width': 500}],
  'paragraphs_count': 6,
  'published_at': datetime.datetime(2020, 3, 22, 19, 11, 21, tzinfo=tzutc()),
  'sentences_count': 29,
  'sentiment': {'body': {'polarity': 'neutral', 'score': 0.550595},
                'title': {'polarity': 'neutral', 'score': 0.620073}},
  'social_shares_count': {'facebook': [],
                          'google_plus': [],
                          'linkedin': [],
                          'reddit': [{'count': 0,
                                      'fetched_at': datetime.datetime(2020, 3, 23, 18, 12, 50, tzinfo=tzutc())},
                                     {'count': 0,
                                      'fetched_at': datetime.datetime(2020, 3, 23, 9, 15, 15, tzinfo=tzutc())},
                                     {'count': 0,
                                      'fetched_at': datetime.datetime(2020, 3, 23, 0, 16, 36, tzinfo=tzutc())}]},
  'source': {'description': None,
             'domain': 'com-unik.info',
             'home_page_url': 'https://www.com-unik.info/',
             'id': 14591,
             'links_in_count': None,
             'locations': [],
             'logo_url': 'https://www.com-unik.info/wp-content/uploads/2017/09/favicon-com-unik.png',
             'name': 'Community Financial News',
             'rankings': {'alexa': [{'country': None,
                                     'fetched_at': datetime.datetime(2019, 6, 6, 16, 29, 13, tzinfo=tzutc()),
                                     'rank': 4774803}]},
             'scopes': [],
             'title': None},
  'summary': {'sentences': ['Zacks Investment Research upgraded AMC '
                            'Entertainment from a “hold” rating to a “buy” '
                            'rating and set a $5.50 target price on the stock '
                            'in a report on Friday, March 6th.',
                            'B. Riley cut AMC Entertainment from a “buy” '
                            'rating to a “neutral” rating and set a $3.50 '
                            'price target for the company.',
                            'Benchmark downgraded shares of AMC Entertainment '
                            'from a “buy” rating to a “hold” rating in a '
                            'research report on Monday, March 16th.',
                            'Barrington Research reduced their price objective '
                            'on shares of AMC Entertainment from $12.00 to '
                            '$7.00 and set an “outperform” rating on the stock '
                            'in a research note on Tuesday, March 10th.',
                            'Finally, Imperial Capital decreased their price '
                            'objective on shares of AMC Entertainment from '
                            '$21.00 to $20.00 and set an “outperform” rating '
                            'for the company in a report on Tuesday, January '
                            '7th.']},
  'title': 'Citigroup Downgrades AMC Entertainment (NYSE:AMC) to Sell',
  'translations': {'en': None},
  'words_count': 625}]

We can see that the story output is a list with one dictionary object representing the story we queried. The story object inlcudes the title, body text, summary sentences and lots of other contextual information that has been made available via AYLIEN's enrichment process.

We can loop through the object's key names to give us a flavour of what is available.

In [3]:
for key in stories[0]:
    print(key)
author
body
categories
characters_count
clusters
entities
hashtags
id
keywords
language
links
media
paragraphs_count
published_at
sentences_count
sentiment
social_shares_count
source
summary
title
translations
words_count

Using Keyword Search and the Cursor

In a real world scenario, we will rarely pull stories using individual story IDs — how would we pull the news we want without knowing the IDs?

More often than not, we pull stories using keyword searches. Using a keyword search, we can search the AYLIEN database for words that appear in the title or body of an article. Here we will search for "Citigroup" in the title.

We will also limit the the date range — if we don't, we could return thousands of stories that feature "Citigroup" in the title — and define the language as English ("en"). Defining the language not only limits our output to English language content, it also allows the query to to remove any relevant stopwords. Learn about stopwords here.

We will also introduce the cursor. We don't know how many stories we'll get, and the cursor will allow us to scan through results. Learn more about using the cursor here. One final thing to note is that we will also convert the AYLIEN story object into a python dictionary.

Note, We will call the fetch_new_stories() function throughout the entire exercise.

In [4]:
# convert all AYLIEN story objects to python dictionaries
def convert_to_dict(stories):
    for index, value in enumerate(stories):
            stories[index] = stories[index].to_dict()
            

# define function to retrieve the stories
def fetch_new_stories(params={}):
    fetched_stories = []
    stories = None
    
    while stories is None or len(stories) > 0:
        try:
            response = api_instance.list_stories(**params)
        except ApiException as e:
            if ( e.status == 429 ):
                print('Usage limits are exceeded. Waiting for 60 seconds...')
                time.sleep(60)
                continue

        stories = response.stories
        convert_to_dict(stories) 
        
        params['cursor'] = response.next_page_cursor

        fetched_stories += stories
        print("Fetched %d stories. Total story count so far: %d" %(len(stories), len(fetched_stories)))
    
    return fetched_stories


# define the query parameters
params = {
  'language': ['en'],
  'title': 'Citigroup',
  'published_at_start':'2020-03-22T00:00:00Z',
  'published_at_end':'2020-03-23T00:00:00Z',
  'cursor': '*',
  'per_page' : 50
}

stories = fetch_new_stories(params)

print('************')
print("Fetched %s stories" %(len(stories)))
Fetched 50 stories. Total story count so far: 50
Fetched 24 stories. Total story count so far: 74
Fetched 0 stories. Total story count so far: 74
************
Fetched 74 stories

We can see that the query returned 74 story objects. Let's print the first 10 titles to get a feel for the stories we have pulled.

In [5]:
for story in stories[0:10]:
    print(story['id'])
    print(story['title'])
    print('')
71142615
Citigroup Downgrades AMC Entertainment (NYSE:AMC) to Sell

71139280
Trade Finance Market 2020 | In-Depth Study On The Current State Of The Industry And Key Insights Of The Business Scenario By 2027 | Citigroup Inc, Credit Agricole, BNP Paribas, JPMorgan Chase & Co

71129651
Orion Portfolio Solutions LLC Invests $743,000 in Citigroup Inc (NYSE:C)

71127613
AMC Entertainment (NYSE:AMC) Lowered to “Sell” at Citigroup

71122386
Ulta Beauty (NASDAQ:ULTA) Upgraded by Citigroup to “Buy”

71096907
Citigroup Trims CymaBay Therapeutics (NASDAQ:CBAY) Target Price to $1.60

71096305
Citigroup Lowers Illumina (NASDAQ:ILMN) Price Target to $250.00

71094705
CASIO COMPUTER/ADR (OTCMKTS:CSIOY) Lifted to Buy at Citigroup

71094437
Meggitt (OTCMKTS:MEGGF) Stock Rating Upgraded by Citigroup

71093159
PRADA S P A/ADR (OTCMKTS:PRDSY) Lifted to Buy at Citigroup

Boolean Search

Excellent! So we have established how to pull stories and convert them into a Python friendly format.

What if we want to refine our keyword search further? We can create more complicated searches using Boolean statements. For instance, if we were interested in searching for news that mentioned Citigroup or Bank of America and that also mentioned "shares" but not "sell", we could write the following query. It is important to note here that the "Bank of America" search term is wrapped in double quotes — if it wasn't, each individual word would be treated as an indivudal search term.

In [6]:
# define the query parameters
params = {
  'language': ['en'],
  'title': '("Citigroup" OR "Bank of America" ) AND "shares" NOT "sell"',
  'published_at_start':'2020-03-22T00:00:00Z',
  'published_at_end':'2020-03-23T00:00:00Z',
  'cursor': '*',
  'per_page' : 50
}

stories = fetch_new_stories(params)

print('************')
print("Fetched %s stories" %(len(stories)))
print('************')

for story in stories:
    print(story['title'])
    print('')
Fetched 14 stories. Total story count so far: 14
Fetched 0 stories. Total story count so far: 14
************
Fetched 14 stories
************
Bank of America Corp DE Acquires 46,700 Shares of TransUnion (NYSE:TRU)

SPDR Wells Fargo Preferred Stock ETF (NYSEARCA:PSK) Shares Sold by Citigroup Inc.

Citigroup Inc. Buys 24,522 Shares of Heritage Commerce Corp. (NASDAQ:HTBK)

Belden Inc. (NYSE:BDC) Shares Bought by Citigroup Inc.

Citigroup Inc. Acquires 1,752 Shares of OceanFirst Financial Corp. (NASDAQ:OCFC)

Citigroup Inc. Buys Shares of 27,982 Global Partners LP (NYSE:GLP)

ProShares UltraPro QQQ (NASDAQ:TQQQ) Shares Sold by Citigroup Inc.

ConturaEnergyInc . (NASDAQ:CTRA) Shares Purchased by Citigroup Inc.

Thomson Reuters Corp (NYSE:TRI) Shares Sold by Bank of America Corp DE

Moody’s Co. (NYSE:MCO) Shares Sold by Bank of America Corp DE

Bank of America Corp DE Acquires 9,915 Shares of M&T Bank Co. (NYSE:MTB)

Bank of America Corp DE Buys 3,470,777 Shares of Luckin Coffee Inc. (NYSE:LK)

Bank of America Corp DE Purchases 41,768 Shares of Schwab US Small-Cap ETF (NYSEARCA:SCHA)

Ancora Advisors LLC Acquires 1,340 Shares of Citigroup Inc (NYSE:C)

We can see that we can refine our query by adding Boolean operators to our keyword search. However, this can become more complicated if we want to cast our net wider. For instance, let's say we want to pull stories about the banking sector in general. Rather than writing a complicated keyword search, we can search by category.

AYLIEN'S NLP enrichment classifies stories into categories to allow us to make more powerful searches. Our classifier is capable of classifying content into two taxonomies where a code corresponds with a a subject. Learn more here.

Here, we will search for all stories classified as "banking" (04006002) using the IPTC subject taxonomy. You can search for other IPTC codes here.

In [7]:
# define the query parameters
params = {
  'language': ['en'],
  'published_at_start':'2020-03-22T00:00:00Z',
  'published_at_end':'2020-03-23T00:00:00Z',
  'categories_taxonomy': 'iptc-subjectcode',
  'categories_id': ['04006002'],
  'cursor': '*',
  'per_page' : 50
}

print(params)

stories = fetch_new_stories(params)

print('************')
print("Fetched %s stories" %(len(stories)))
{'language': ['en'], 'published_at_start': '2020-03-22T00:00:00Z', 'published_at_end': '2020-03-23T00:00:00Z', 'categories_taxonomy': 'iptc-subjectcode', 'categories_id': ['04006002'], 'cursor': '*', 'per_page': 50}
Fetched 50 stories. Total story count so far: 50
Fetched 50 stories. Total story count so far: 100
Fetched 50 stories. Total story count so far: 150
Fetched 50 stories. Total story count so far: 200
Fetched 50 stories. Total story count so far: 250
Fetched 50 stories. Total story count so far: 300
Fetched 50 stories. Total story count so far: 350
Fetched 50 stories. Total story count so far: 400
Fetched 50 stories. Total story count so far: 450
Fetched 50 stories. Total story count so far: 500
Fetched 50 stories. Total story count so far: 550
Fetched 50 stories. Total story count so far: 600
Fetched 50 stories. Total story count so far: 650
Fetched 50 stories. Total story count so far: 700
Fetched 50 stories. Total story count so far: 750
Fetched 11 stories. Total story count so far: 761
Fetched 0 stories. Total story count so far: 761
************
Fetched 761 stories

We can see that far more stories were returned that time — clearly there were lots of stories classified as "banking" in this timeframe.

Similarly, we may be interested in searching for certain recurring subjects appearing in the news for example, banks, companies, dogs or even aliens! We could do this using keyword search but AYLIEN provides a solution to this problem by classifying some words as "enties".

What is an entity? The Oxford English Dictionary provides a basic starting point of what an entity is, with its definition being "a thing with distinct and independent existence". Learn more about searching for entities here.

Returning to our query that pulled stories classifed as "banking", let's pull all articles categorised as banking that also feature a "Company" or "Bank" entity in the title:

In [8]:
# define the query parameters
params = {
  'language': ['en'],
  'published_at_start':'2020-03-22T00:00:00Z',
  'published_at_end':'2020-03-23T00:00:00Z',
  'categories_taxonomy': 'iptc-subjectcode',
  'categories_id': ['04006002'],
  'entities_title_type': ['Bank', 'Company'],
  'cursor': '*',
  'per_page' : 50
}

print(params)

stories = fetch_new_stories(params)

print('************')
print("Fetched %s stories" %(len(stories)))
{'language': ['en'], 'published_at_start': '2020-03-22T00:00:00Z', 'published_at_end': '2020-03-23T00:00:00Z', 'categories_taxonomy': 'iptc-subjectcode', 'categories_id': ['04006002'], 'entities_title_type': ['Bank', 'Company'], 'cursor': '*', 'per_page': 50}
Fetched 50 stories. Total story count so far: 50
Fetched 50 stories. Total story count so far: 100
Fetched 50 stories. Total story count so far: 150
Fetched 50 stories. Total story count so far: 200
Fetched 50 stories. Total story count so far: 250
Fetched 50 stories. Total story count so far: 300
Fetched 50 stories. Total story count so far: 350
Fetched 32 stories. Total story count so far: 382
Fetched 0 stories. Total story count so far: 382
************
Fetched 382 stories

So we've returned less stories than before because not all of the titles included an entity type we specified. Let's look closely at the very first story we pulled on Citigroup and see if the title included any entities.

In [9]:
# define the query parameters
params = {
  'id': [71142615],
  'cursor': '*',
  'per_page' : 50
}

stories = fetch_new_stories(params)

print('************')
print("Fetched %s stories" %(len(stories)))
print('************')

# print the entities identified in the story title
for story in stories:
    print(story['title'])
    print('')
    for entity in story['entities']['title']:
        print('Text: ' + entity['text'])
        print('Entity types: ' + str(entity['types']))
        print('Entity links: ' + str(entity['links']))
        print('')
Fetched 1 stories. Total story count so far: 1
Fetched 0 stories. Total story count so far: 1
************
Fetched 1 stories
************
Citigroup Downgrades AMC Entertainment (NYSE:AMC) to Sell

Text: NYSE
Entity types: ['Exchange', 'ArchitecturalStructure', 'Building', 'Location', 'Place', 'Company', 'Organisation']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/New_York_Stock_Exchange'}

Text: AMC
Entity types: ['Manufacturer', 'Organisation', 'Agent', 'Company']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/Volkswagen'}

Text: AMC Entertainment
Entity types: ['Chain', 'Organisation', 'Agent', 'Company']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/AMC_Theatres'}

Text: Citigroup
Entity types: ['Organisation', 'Bank', 'Agent', 'Company']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/Citigroup'}

Text: Citigroup Downgrades AMC Entertainment
Entity types: ['Organisation']
Entity links: None

Cool! We can see that the classifier picked up some entities. We can also see some of the entities are linked to a DBPedia entry — we will return to this below.

We are not limited to working with entities in the title however. We can also search for entities in the body of the article. Let's print out the first 10 entities in the body. We can see that AYLIEN's enrichment process identifies a whole range of entity types.

In [10]:
for story in stories:
    for entity in story['entities']['body'][0:10]:
        print('Text: ' + entity['text'])
        print('Entity types: ' + str(entity['types']))
        print('Entity links: ' + str(entity['links']))
        print('')
Text: ratings
Entity types: ['Systems', 'Place']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/Nielsen_ratings'}

Text: market capitalization
Entity types: ['Value', 'Company']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/Market_capitalization'}

Text: Barrington
Entity types: ['Location', 'PopulatedPlace', 'Place', 'Settlement', 'Village']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/Barrington,_Illinois'}

Text: AMC Entertainment
Entity types: ['Chain', 'Organisation', 'Agent', 'Company']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/AMC_Theatres'}

Text: PE ratio
Entity types: ['Ratio']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/Price–earnings_ratio'}

Text: United States
Entity types: ['Location', 'Country', 'Person', 'Republic', 'PopulatedPlace', 'Place']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/United_States'}

Text: The firm
Entity types: ['Film', 'Product', 'Wikidata:Q11424', 'Work']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/The_Firm_(1993_film)'}

Text: moving average
Entity types: ['Calculation']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/Moving_average'}

Text: Geode
Entity types: ['Structures', 'Building']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/Geode'}

Text: NYSE
Entity types: ['Exchange', 'ArchitecturalStructure', 'Building', 'Location', 'Place', 'Company', 'Organisation']
Entity links: {'dbpedia': 'http://dbpedia.org/resource/New_York_Stock_Exchange'}

We have seen how AYLIEN's NLP enrichment identifies entities and that some entities are tagged with a DBPedia link. Entities can be useful when a keyword or search term can refer to multiple entities. For example, let's imagine we are interested in finding news regarding the company, Apple — how do we restrict searches for the company only and ignore searches for the fruit? We could search for the keyword "Apple" and also search for company entity types as described above, but then we would run the risk of returning titles that include companies other than Apple Inc. but that mention the fruit, apple. We can, however, perform a more specific search using DBPedia entries.

DBpedia is a semantic web project that extracts structured information created as part of the Wikipedia project where distinct entities are referred to by URIs (like http://dbpedia.org/resource/Apple_Inc.) Using these URIs, we can perform very specific searches for topics and reduce the ambiguity in our query.

Below, we'll demonstrate a search for Citigroup using its DBPedia URI.

In [11]:
# define the query parameters
params = {
  'language': ['en'],
  'published_at_start':'2020-03-22T00:00:00Z',
  'published_at_end':'2020-03-23T00:00:00Z',
  'entities_body_links_dbpedia': ['http://dbpedia.org/resource/Citigroup'],
  'cursor': '*',
  'per_page' : 50
}

stories = fetch_new_stories(params)

print('************')
print("Fetched %s stories" %(len(stories)))
Fetched 50 stories. Total story count so far: 50
Fetched 50 stories. Total story count so far: 100
Fetched 50 stories. Total story count so far: 150
Fetched 50 stories. Total story count so far: 200
Fetched 50 stories. Total story count so far: 250
Fetched 50 stories. Total story count so far: 300
Fetched 50 stories. Total story count so far: 350
Fetched 50 stories. Total story count so far: 400
Fetched 50 stories. Total story count so far: 450
Fetched 50 stories. Total story count so far: 500
Fetched 50 stories. Total story count so far: 550
Fetched 50 stories. Total story count so far: 600
Fetched 50 stories. Total story count so far: 650
Fetched 50 stories. Total story count so far: 700
Fetched 50 stories. Total story count so far: 750
Fetched 50 stories. Total story count so far: 800
Fetched 50 stories. Total story count so far: 850
Fetched 50 stories. Total story count so far: 900
Fetched 50 stories. Total story count so far: 950
Fetched 50 stories. Total story count so far: 1000
Fetched 50 stories. Total story count so far: 1050
Fetched 50 stories. Total story count so far: 1100
Fetched 30 stories. Total story count so far: 1130
Fetched 0 stories. Total story count so far: 1130
************
Fetched 1130 stories

Below, we'll iterate through the the entities in the body of the first story in the output to see how they are linked to DBPedia entries.

In [12]:
# loop through entiies in body and print text and DBPedia link
for entity in stories[0]['entities']['body']:
    print(entity['text'])
    print(entity['links'])
Modern Portfolio Theory
{'dbpedia': 'http://dbpedia.org/resource/Modern_portfolio_theory'}
MPT
{'dbpedia': 'http://dbpedia.org/resource/Modern_portfolio_theory'}
asset allocation
{'dbpedia': 'http://dbpedia.org/resource/Asset_allocation'}
algorithm
{'dbpedia': 'http://dbpedia.org/resource/Algorithm'}
iShares
{'dbpedia': 'http://dbpedia.org/resource/IShares'}
oil
{'dbpedia': 'http://dbpedia.org/resource/Petroleum'}
SPDR
{'dbpedia': 'http://dbpedia.org/resource/SPDR'}
Wall Street
{'dbpedia': 'http://dbpedia.org/resource/Wall_Street'}
USO
{'dbpedia': 'http://dbpedia.org/resource/United_Service_Organizations'}
asset
{'dbpedia': 'http://dbpedia.org/resource/Asset'}
indicative
{'dbpedia': 'http://dbpedia.org/resource/Realis_mood'}
Facebook
{'dbpedia': 'http://dbpedia.org/resource/Facebook'}
epidemic
{'dbpedia': 'http://dbpedia.org/resource/Epidemic'}
Bitcoin
{'dbpedia': 'http://dbpedia.org/resource/Bitcoin'}
Invesco
{'dbpedia': 'http://dbpedia.org/resource/Invesco'}
asset classes
{'dbpedia': 'http://dbpedia.org/resource/Asset_classes'}
Artificial Intelligence
{'dbpedia': 'http://dbpedia.org/resource/Artificial_intelligence'}
artificial intelligence
{'dbpedia': 'http://dbpedia.org/resource/Artificial_intelligence'}
AI
{'dbpedia': 'http://dbpedia.org/resource/Artificial_intelligence'}
USD$
{'dbpedia': 'http://dbpedia.org/resource/United_States_dollar'}
US Dollar
{'dbpedia': 'http://dbpedia.org/resource/United_States_dollar'}
HYG
{'dbpedia': 'http://dbpedia.org/resource/Asteroid_family'}
clustering
{'dbpedia': 'http://dbpedia.org/resource/Cluster_analysis'}
volatility
{'dbpedia': 'http://dbpedia.org/resource/Volatility_(finance)'}
centroid
{'dbpedia': 'http://dbpedia.org/resource/Centroid'}
non-linear
{'dbpedia': 'http://dbpedia.org/resource/Nonlinear_system'}
Citigroup
{'dbpedia': 'http://dbpedia.org/resource/Citigroup'}
coronavirus
{'dbpedia': 'http://dbpedia.org/resource/Coronavirus'}
Coronavirus
{'dbpedia': 'http://dbpedia.org/resource/Coronavirus'}
Apple
{'dbpedia': 'http://dbpedia.org/resource/Apple_Inc.'}
United States Oil Fund
None
US
None
Citigroup (C
None
EFA
None
Invesco DB
None
Built
None
Conventional
None
Notably
None
ETF Trust
None
SPY
None
IWM
None
Treasury
None
TLT)
 
 iShares
None
BTC/USD$ Cross
None
UUP
None
TLT
None
ETF
None
Volatility
None
Apple (AAPL
None
Correlations
None
Correlation Analysis
None
Affinity Propagation
None
Affinity
None
K-Means
None
BTC
None

Non-English Content

So far we have pulled stories in English only. However, our News API supports 6 native languages and 10 translated languages:

Native Languages:

  • English (en)
  • German (de)
  • French (fr)
  • Italian (it)
  • Spanish (es)
  • Portugese (pt)

Translated Languages:

  • Arabic (ar)
  • Danish (da)
  • Finnish (fi)
  • Dutch (nl)
  • Norwegian (no)
  • Russian (ru)
  • Swedish (sv)
  • Turkish (tr)
  • Chinese (simplified) (zh-tw)
  • Chinese (traditional) (zh-cn)

Let's perform a search in some native languages other than English. Here we'll search for stories featuring Citigroup in the title and print the native language title and an English title.

In [13]:
# define the query parameters
params = {
  'language': ['de', 'fr', 'it', 'es', 'pt'],
  'title': 'Citigroup',
  'published_at_start':'2020-03-11T00:00:00Z',
  'published_at_end':'2020-03-12T00:00:00Z',
  'cursor': '*',
  'per_page' : 50
}


print(params)

stories = fetch_new_stories(params)

print('************')
print("Fetched %s stories" %(len(stories)))

for story in stories:
    print(story['title'])
    print(story['translations']['en']['title'])
    print('')
{'language': ['de', 'fr', 'it', 'es', 'pt'], 'title': 'Citigroup', 'published_at_start': '2020-03-11T00:00:00Z', 'published_at_end': '2020-03-12T00:00:00Z', 'cursor': '*', 'per_page': 50}
Fetched 5 stories. Total story count so far: 5
Fetched 0 stories. Total story count so far: 5
************
Fetched 5 stories
Citigroup mit deutlichen Kursverlusten von 4 Prozent
Citigroup with a 4% drop in course

JPM und Citigroup sollen durch Virus-Panik mehr Erträge zufallen
JPM and Citigroup to receive more revenue through virus panic

Investmentbanken: JPM und Citigroup sollen durch Virus-Panik ein halbe Milliarde Dollar mehr Erträge zufallen
Investment banks: JPM and Citigroup to pay more than $500 billion in revenue through virus panic

Investmentbanken: JPM und Citigroup sollen durch Virus-Panik ein halbe Milliarde Dollar mehr Erträge zufallen
Investment banks: JPM and Citigroup to pay more than $500 billion in revenue through virus panic

Citigroup-Aktie Aktuell - Citigroup mit Kursgewinnen
Citigroup Action Current - Citigroup with Rate winnings

Create a Pandas Dataframe From a List of Stories Dictionaries

Up to now we have interrogated our News API output by converting the JSON objects to Python dictionaries, iterating through them and printing the elements, but sometimes we may wish to view the data in a more tabular format. Below, we will loop through our non-English content stories and create a Pandas dataframe. This will also be useful later when we want to visualize our data.

We'll also pull out some contextual information about each story such as the article's permalink and the stories' sentiment score. AYLIEN's enrichment process predicts the overall sentiment in the body and title of a document as positive, negative and neutral and also outputs a confidence score.

In [14]:
# create dataframe in the format we want
my_columns = ['id', 'title', 'title_eng', 'permalink', 'published_at', 'source', 'body_polarity', 'body_polarity_score']
my_data_frame = pd.DataFrame(columns = my_columns)

for story in stories:
    
    data = [[
                story['id']
                , story['title']
                , story['translations']['en']['title']
                , story['links']['permalink']
                , story['published_at']
                , story['source']['domain']
                , story['sentiment']['body']['polarity']
                , story['sentiment']['body']['score']
            ]]
    
    data = pd.DataFrame(data, columns = my_columns)
    my_data_frame = my_data_frame.append(data)

my_data_frame
Out[14]:
id title title_eng permalink published_at source body_polarity body_polarity_score
0 68618890 Citigroup mit deutlichen Kursverlusten von 4 P... Citigroup with a 4% drop in course https://www.focus.de/finanzen/boerse/aktien/ci... 2020-03-11 23:04:41+00:00 focus.de positive 0.521656
0 68486107 JPM und Citigroup sollen durch Virus-Panik meh... JPM and Citigroup to receive more revenue thro... https://www.wiwo.de/investmentbanken-jpm-und-c... 2020-03-11 12:33:22+00:00 wiwo.de negative 0.626843
0 68462796 Investmentbanken: JPM und Citigroup sollen dur... Investment banks: JPM and Citigroup to pay mor... https://www.wiwo.de/investmentbanken-jpm-und-c... 2020-03-11 11:03:59+00:00 wiwo.de negative 0.626843
0 68459400 Investmentbanken: JPM und Citigroup sollen dur... Investment banks: JPM and Citigroup to pay mor... https://www.handelsblatt.com/finanzen/banken-v... 2020-03-11 10:47:19+00:00 handelsblatt.com neutral 0.629335
0 68407895 Citigroup-Aktie Aktuell - Citigroup mit Kursge... Citigroup Action Current - Citigroup with Rate... https://www.focus.de/finanzen/boerse/aktien/ci... 2020-03-11 05:45:01+00:00 focus.de neutral 0.631298

The Timeseries Endpoint

Pull Timeseries

We have seen how we can pull granular stories using the Stories endpoint. However, if we want to investigate volumes of stories over time, we can use the Timeseries endpoint. This endpoint retrieves the stories that meet our criteria and aggregates per minute, hour, day, month, or however we see fit. This can be very usfeul for identifying spikes or dips in news volume relating to a subject of interest. By default, our query below will aggregate the volume of stories per day.

In [15]:
# define the query parameters
params = {
  'title': 'Citigroup',
  'published_at_start':'2020-03-01T00:00:00Z',
  'published_at_end':'2020-04-01T00:00:00Z'
}

try:
    # List time series
    api_response = api_instance.list_time_series(**params)
except ApiException as e:
    print("Exception when calling DefaultApi->list_time_series: %s\n" % e)

pprint(api_response)
{'period': '+1DAY',
 'published_at_end': datetime.datetime(2020, 4, 1, 0, 0, tzinfo=tzutc()),
 'published_at_start': datetime.datetime(2020, 3, 1, 0, 0, tzinfo=tzutc()),
 'time_series': [{'count': 15,
                  'published_at': datetime.datetime(2020, 3, 1, 0, 0, tzinfo=tzutc())},
                 {'count': 60,
                  'published_at': datetime.datetime(2020, 3, 2, 0, 0, tzinfo=tzutc())},
                 {'count': 22,
                  'published_at': datetime.datetime(2020, 3, 3, 0, 0, tzinfo=tzutc())},
                 {'count': 38,
                  'published_at': datetime.datetime(2020, 3, 4, 0, 0, tzinfo=tzutc())},
                 {'count': 16,
                  'published_at': datetime.datetime(2020, 3, 5, 0, 0, tzinfo=tzutc())},
                 {'count': 42,
                  'published_at': datetime.datetime(2020, 3, 6, 0, 0, tzinfo=tzutc())},
                 {'count': 76,
                  'published_at': datetime.datetime(2020, 3, 7, 0, 0, tzinfo=tzutc())},
                 {'count': 131,
                  'published_at': datetime.datetime(2020, 3, 8, 0, 0, tzinfo=tzutc())},
                 {'count': 117,
                  'published_at': datetime.datetime(2020, 3, 9, 0, 0, tzinfo=tzutc())},
                 {'count': 98,
                  'published_at': datetime.datetime(2020, 3, 10, 0, 0, tzinfo=tzutc())},
                 {'count': 106,
                  'published_at': datetime.datetime(2020, 3, 11, 0, 0, tzinfo=tzutc())},
                 {'count': 98,
                  'published_at': datetime.datetime(2020, 3, 12, 0, 0, tzinfo=tzutc())},
                 {'count': 125,
                  'published_at': datetime.datetime(2020, 3, 13, 0, 0, tzinfo=tzutc())},
                 {'count': 23,
                  'published_at': datetime.datetime(2020, 3, 14, 0, 0, tzinfo=tzutc())},
                 {'count': 38,
                  'published_at': datetime.datetime(2020, 3, 15, 0, 0, tzinfo=tzutc())},
                 {'count': 69,
                  'published_at': datetime.datetime(2020, 3, 16, 0, 0, tzinfo=tzutc())},
                 {'count': 56,
                  'published_at': datetime.datetime(2020, 3, 17, 0, 0, tzinfo=tzutc())},
                 {'count': 25,
                  'published_at': datetime.datetime(2020, 3, 18, 0, 0, tzinfo=tzutc())},
                 {'count': 68,
                  'published_at': datetime.datetime(2020, 3, 19, 0, 0, tzinfo=tzutc())},
                 {'count': 70,
                  'published_at': datetime.datetime(2020, 3, 20, 0, 0, tzinfo=tzutc())},
                 {'count': 77,
                  'published_at': datetime.datetime(2020, 3, 21, 0, 0, tzinfo=tzutc())},
                 {'count': 76,
                  'published_at': datetime.datetime(2020, 3, 22, 0, 0, tzinfo=tzutc())},
                 {'count': 70,
                  'published_at': datetime.datetime(2020, 3, 23, 0, 0, tzinfo=tzutc())},
                 {'count': 114,
                  'published_at': datetime.datetime(2020, 3, 24, 0, 0, tzinfo=tzutc())},
                 {'count': 124,
                  'published_at': datetime.datetime(2020, 3, 25, 0, 0, tzinfo=tzutc())},
                 {'count': 45,
                  'published_at': datetime.datetime(2020, 3, 26, 0, 0, tzinfo=tzutc())},
                 {'count': 192,
                  'published_at': datetime.datetime(2020, 3, 27, 0, 0, tzinfo=tzutc())},
                 {'count': 16,
                  'published_at': datetime.datetime(2020, 3, 28, 0, 0, tzinfo=tzutc())},
                 {'count': 151,
                  'published_at': datetime.datetime(2020, 3, 29, 0, 0, tzinfo=tzutc())},
                 {'count': 70,
                  'published_at': datetime.datetime(2020, 3, 30, 0, 0, tzinfo=tzutc())},
                 {'count': 53,
                  'published_at': datetime.datetime(2020, 3, 31, 0, 0, tzinfo=tzutc())}]}

We can see that the timeseries object inlcudes a timeseries array that outputs the number of stories published per day as default. We can convert this data to a Pandas dataframe to make it more clearly legible:

In [16]:
#convert to dictionary
api_response = api_response.to_dict()
#convert to dataframe
timeseries_data = pd.DataFrame(api_response['time_series'])

# print the dataframe
timeseries_data
Out[16]:
count published_at
0 15 2020-03-01 00:00:00+00:00
1 60 2020-03-02 00:00:00+00:00
2 22 2020-03-03 00:00:00+00:00
3 38 2020-03-04 00:00:00+00:00
4 16 2020-03-05 00:00:00+00:00
5 42 2020-03-06 00:00:00+00:00
6 76 2020-03-07 00:00:00+00:00
7 131 2020-03-08 00:00:00+00:00
8 117 2020-03-09 00:00:00+00:00
9 98 2020-03-10 00:00:00+00:00
10 106 2020-03-11 00:00:00+00:00
11 98 2020-03-12 00:00:00+00:00
12 125 2020-03-13 00:00:00+00:00
13 23 2020-03-14 00:00:00+00:00
14 38 2020-03-15 00:00:00+00:00
15 69 2020-03-16 00:00:00+00:00
16 56 2020-03-17 00:00:00+00:00
17 25 2020-03-18 00:00:00+00:00
18 68 2020-03-19 00:00:00+00:00
19 70 2020-03-20 00:00:00+00:00
20 77 2020-03-21 00:00:00+00:00
21 76 2020-03-22 00:00:00+00:00
22 70 2020-03-23 00:00:00+00:00
23 114 2020-03-24 00:00:00+00:00
24 124 2020-03-25 00:00:00+00:00
25 45 2020-03-26 00:00:00+00:00
26 192 2020-03-27 00:00:00+00:00
27 16 2020-03-28 00:00:00+00:00
28 151 2020-03-29 00:00:00+00:00
29 70 2020-03-30 00:00:00+00:00
30 53 2020-03-31 00:00:00+00:00

Visualizing Timeseries

We can makes sense of timeseries data much quicker if we visualize it. Below, we make use out of Plotly library to visualize the data.

In [17]:
fig = go.Figure( data = go.Scatter( 
                                    x = timeseries_data['published_at']
                                    , y=timeseries_data['count']
                                    , line=dict(color='blue')
                                    ))
# forrmat the chart
fig.update_layout(
    title='Volume of Stories Over Time',
    plot_bgcolor='white',
    xaxis=dict(
        gridcolor='rgb(204, 204, 204)',
        linecolor='rgb(204, 204, 204)'
        )
    , yaxis=dict(
        gridcolor='rgb(204, 204, 204)',
        linecolor='rgb(204, 204, 204)'
        )
)

fig.show()

Exploring Spikes in Timeseries Data

We can see from the graph that there are various spikes in news volume. We can explore the cause of these spikes by pulling a story that will give us an indication of why Citigroup received so much attention using Alexa Ranking. Alexa Ranking is an estimate of a site's popularity on the internet. Learn more about working with Alexa Ranking here.

Below, we'll identify the three dates with the most stories, then pull the highest ranked story for those dates using the same parameters we used to query the Timeseries endpoint.

In [18]:
# create dataframe to store the label data - note, the publihset_at and count fields are needed for x and y coords.
# the count field will be populated with the total count of stories for each respective day
my_columns = ['published_at', 'count', 'title']
label_data = pd.DataFrame(columns = my_columns)


# identify the dates with most stories
top_3_dates = timeseries_data.sort_values(by=['count'], ascending = False)[0:3]

for index, row in top_3_dates.iterrows():
    
    params['published_at_start'] = str(row['published_at'].date()) + 'T00:00:00Z'
    params['published_at_end'] = str(row['published_at'].date()) + 'T23:59:59Z'
    
    try:
        # List Stories
        response = api_instance.list_stories(**params)
    except ApiException as e:
        print("Exception when calling DefaultApi->list_stories: %s\n" % e)
        
    stories = response.stories
    convert_to_dict(stories)
    
    data = [[
                stories[0]['published_at'].date() # extract the date only so
                , row['count']
                , stories[0]['title']
            ]]
    
    data = pd.DataFrame(data, columns = my_columns)
    label_data = label_data.append(data, sort=True)
    
label_data
Out[18]:
count published_at title
0 192 2020-03-27 CarGurus (NASDAQ:CARG) Price Target Cut to $28...
0 151 2020-03-29 First Quantum Minerals (OTCMKTS:FQVLF) Rating ...
0 131 2020-03-08 Citigroup splits Buffalo, NY, trading staff as...

Add Labels to Timeseries Spikes

We will now append these titles to the spikes in the graph we previously created. If we hover over the markers, the tooltip will display the relvant story title.

In [19]:
trace_1 = go.Scatter( 
                        x = timeseries_data['published_at']
                        , y=timeseries_data['count']
                        , name = 'Volume of Stories'
                        , line=dict(color='blue')
                        )

trace_2 = go.Scatter(
        x = label_data['published_at']
        , y = label_data['count']
        , mode ='markers'
        , marker=dict(size=10,line=dict(width=2, color='blue'), color = 'white')
        , hovertemplate =  label_data['title']
        , name = 'News Title'
    )

data = [trace_1, trace_2]

fig = go.Figure(data=data)

# forrmat the chart
fig.update_layout(
    title='Volume of Stories Over Time',
    legend = dict(orientation = 'h', y = -0.1),
    plot_bgcolor='white',
    xaxis=dict(
        gridcolor='rgb(204, 204, 204)',
        linecolor='rgb(204, 204, 204)'
        )
    , yaxis=dict(
        gridcolor='rgb(204, 204, 204)',
        linecolor='rgb(204, 204, 204)'
        )
)

fig.show()

Pull Timeseries by Sentiment

We filter our timeseries queries in the same ways as stories, but one filter that is particularly interesting is filtering on sentiment. We have already discussed how stories are given a sentiment score at a granular level and we can use this score to pull volume of stories by sentiment polarity over time.

In the cell below, we run a function that pulls queries the Timeseries endpoint twice — once for positive sentiment stories and once for negative stories.

In [20]:
# define the query parameters
params = {
  'title': 'Citigroup',
  'published_at_start':'2020-03-01T00:00:00Z',
  'published_at_end':'2020-04-01T00:00:00Z'
}

polarities = [ 'positive', 'negative']

# Create dataframe to store the outputs
column_names = ["count", "published_at", "sentiment_body_polarity"]
timeseries_sentiment_data = pd.DataFrame(columns = column_names)

for polarity in polarities:

    print('===========================================')
    print('         ' + polarity + ' sentiment         ')
    print('===========================================')

    params['sentiment_body_polarity'] = polarity

    try:
        # List time series
        api_response = api_instance.list_time_series(**params)
        print('Completed API call')

    except ApiException as e:
        print("Exception when calling DefaultApi->list_time_series: %s\n" % e)

    # Convert TimeSeriesList to python Dictionary
    api_response = api_response.to_dict()
    #convert data to dataframe for visualization
    api_response = pd.DataFrame(api_response['time_series'])

    #add polarity indicator
    api_response['sentiment_body_polarity'] = polarity

    timeseries_sentiment_data = timeseries_sentiment_data.append(api_response)

    print("Completed")

timeseries_sentiment_data
===========================================
         positive sentiment         
===========================================
Completed API call
Completed
===========================================
         negative sentiment         
===========================================
Completed API call
Completed
Out[20]:
count published_at sentiment_body_polarity
0 2 2020-03-01 00:00:00+00:00 positive
1 10 2020-03-02 00:00:00+00:00 positive
2 5 2020-03-03 00:00:00+00:00 positive
3 11 2020-03-04 00:00:00+00:00 positive
4 5 2020-03-05 00:00:00+00:00 positive
... ... ... ...
26 83 2020-03-27 00:00:00+00:00 negative
27 4 2020-03-28 00:00:00+00:00 negative
28 71 2020-03-29 00:00:00+00:00 negative
29 13 2020-03-30 00:00:00+00:00 negative
30 11 2020-03-31 00:00:00+00:00 negative

62 rows × 3 columns

Visualizing Timeseries by Sentiment

In [21]:
colours = {
            'positive' : 'rgb(138, 190, 6)'
             , 'positive_opaque' : 'rgba(138, 190, 6, 0.05)'
       
             , 'negative' : 'rgb(228, 42, 58)'
             , 'negative_opaque' : 'rgba(228, 42, 58, 0.05)'
    
            , 'neutral' : 'rgb(40, 56, 78)'
            , 'neutral_opaque' : 'rgba(40, 56, 78, 0.05)'
            }


# we will plot two subplots using the same axes
fig = make_subplots(rows=1, cols=1)

counter = 0

# loop over postive and negative sentiment data to generate to line graphs
# start of for loop =======================================================================================  
for polarity in polarities:    
     
    if polarity == 'negative':
        # multiply absolute number of stories by -1 to visualize negative sentiment stories
        factor = -1
    else:
        factor = 1

    # filter to the data we want to visualize based on sentiment    
    data = timeseries_sentiment_data[timeseries_sentiment_data.sentiment_body_polarity == polarity]

    fig.append_trace(go.Scatter(
        x = data['published_at']
        , y = data['count']*factor
        , mode = 'lines'
        , name = 'Vol. stories '+polarity
        , line = dict(color = colours[polarity], width=1)
        , fill = 'tozeroy'
        , fillcolor = colours[polarity + "_opaque"]
        , hovertemplate =  '<b>Date</b>: %{x}<br>'
                            +'<b>Stories</b>: %{y}'
    ) 
    , col = 1
    , row = 1)

# end of for loop =======================================================================================

# forrmat the chart
fig.update_layout(
    title='Volume of Positive & Negative Sentiment Stories Over Time',
    legend = dict(orientation = 'h', y = -0.1),
    plot_bgcolor='white',
    xaxis=dict(
        gridcolor='rgb(204, 204, 204)',
        linecolor='rgb(204, 204, 204)'
        )
    , yaxis=dict(
        gridcolor='rgb(204, 204, 204)',
        linecolor='rgb(204, 204, 204)'
        )
)

fig.show()

Similar to the Timeseries endpoint, we may be interested in seeing themes and patterns over time that aren't immediately apparent when looking at individual stories. The Trends endpoint allows us to see the most frequently recurring entities, concepts or keywords that appear in articles that meet our search criteria.

Below we will pull the most frequently occuring entities in the body of stories mentioning Citigroup over a month.

Note- this query will take longer to run than previous endpoints as the News API is performing analysis on all entities included in all the stories that meet our search citeria.

In [22]:
# define the query parameters
params = {
  'title': 'Citigroup',
  'published_at_start':'2020-03-01T00:00:00Z',
  'published_at_end':'2020-04-01T00:00:00Z'
}

field = 'entities.body.text'

print("Running...")

try:
    api_response = api_instance.list_trends(field, **params)
except ApiException as e:
    print("Exception when calling DefaultApi->list_time_series: %s\n" % e)

#convert to dictionary
trends_dictionary = api_response.to_dict()


print("Completed")
pprint(trends_dictionary)
Running...
Completed
{'field': 'entities.body.text',
 'trends': [{'count': 1970, 'value': 'ratings'},
            {'count': 1942, 'value': 'MarketBeat.com'},
            {'count': 1394, 'value': 'hedge funds'},
            {'count': 1387, 'value': 'Receive News & Ratings'},
            {'count': 1374, 'value': 'Citigroup'},
            {'count': 1357, 'value': 'The firm'},
            {'count': 1326, 'value': 'institutional investors'},
            {'count': 1243, 'value': 'stake'},
            {'count': 1137, 'value': 'Citigroup Inc'},
            {'count': 1120, 'value': 'moving average'},
            {'count': 1012, 'value': 'NYSE'},
            {'count': 971, 'value': 'Citigroup Inc.'},
            {'count': 939, 'value': 'SEC'},
            {'count': 857, 'value': 'Shares'},
            {'count': 849, 'value': 'equity'},
            {'count': 805, 'value': 'Zacks Investment Research'},
            {'count': 789, 'value': 'EPS'},
            {'count': 729, 'value': 'market capitalization'},
            {'count': 710, 'value': 'ValuEngine'},
            {'count': 672, 'value': 'market cap'},
            {'count': 653, 'value': 'HoldingsChannel.com'},
            {'count': 648, 'value': 'dividend'},
            {'count': 629, 'value': 'NASDAQ'},
            {'count': 608, 'value': 'United States'},
            {'count': 567, 'value': 'simple moving average'},
            {'count': 560, 'value': 'Want'},
            {'count': 543, 'value': 'LLC'},
            {'count': 536, 'value': 'Securities & Exchange Commission'},
            {'count': 513, 'value': 'overweight'},
            {'count': 479, 'value': 'P/E ratio'},
            {'count': 467, 'value': 'Securities and Exchange Commission'},
            {'count': 466, 'value': 'price-to-earnings ratio'},
            {'count': 459, 'value': 'PE'},
            {'count': 457, 'value': 'PE ratio'},
            {'count': 369, 'value': 'fiscal year'},
            {'count': 367, 'value': 'institutional investor'},
            {'count': 359, 'value': 'BidaskClub'},
            {'count': 342, 'value': 'Morgan Stanley'},
            {'count': 331, 'value': 'The Fly'},
            {'count': 325, 'value': 'Europe'},
            {'count': 320, 'value': 'Fly'},
            {'count': 318, 'value': 'financial services'},
            {'count': 312, 'value': 'Inc'},
            {'count': 295, 'value': 'dividend yield'},
            {'count': 294, 'value': 'Barclays'},
            {'count': 290, 'value': 'Asia'},
            {'count': 289, 'value': 'Canada'},
            {'count': 285, 'value': 'Wells Fargo'},
            {'count': 281, 'value': 'JPMorgan Chase'},
            {'count': 270, 'value': 'North America'},
            {'count': 257, 'value': 'Goldman Sachs'},
            {'count': 254, 'value': 'Wells Fargo & Co'},
            {'count': 254, 'value': 'Zacks'},
            {'count': 251, 'value': 'JPMorgan Chase & Co'},
            {'count': 249, 'value': 'Hedge'},
            {'count': 249, 'value': 'P/E'},
            {'count': 244, 'value': 'JPMorgan Chase & Co.'},
            {'count': 236, 'value': 'SEC filing'},
            {'count': 229, 'value': 'PEG'},
            {'count': 228, 'value': 'Deutsche Bank'},
            {'count': 227, 'value': 'Form'},
            {'count': 221, 'value': 'Credit Suisse Group'},
            {'count': 221, 'value': 'holding company'},
            {'count': 217, 'value': 'State Street Corp'},
            {'count': 215, 'value': 'TheStreet'},
            {'count': 212, 'value': 'Middle East'},
            {'count': 208, 'value': '52-week'},
            {'count': 206, 'value': 'Goldman Sachs Group'},
            {'count': 204, 'value': 'Africa'},
            {'count': 204, 'value': 'hyperlink'},
            {'count': 203, 'value': 'CEO'},
            {'count': 202, 'value': 'Bank of America'},
            {'count': 190, 'value': 'DPR'},
            {'count': 190, 'value': '“Buy'},
            {'count': 189, 'value': 'Royal Bank of Canada'},
            {'count': 186, 'value': 'Thomson Reuters'},
            {'count': 184, 'value': '12-month'},
            {'count': 180, 'value': '1-year'},
            {'count': 175, 'value': 'Reading'},
            {'count': 174, 'value': 'Latin America'},
            {'count': 169, 'value': 'investment analyst'},
            {'count': 168, 'value': 'Great West Life Assurance Co'},
            {'count': 165, 'value': 'Buy'},
            {'count': 164, 'value': 'bank'},
            {'count': 163, 'value': 'PLC'},
            {'count': 160, 'value': 'Read More'},
            {'count': 151, 'value': 'Global Consumer Banking'},
            {'count': 151, 'value': 'UBS'},
            {'count': 150, 'value': 'Citigroup Daily - Enter'},
            {'count': 150, 'value': 'GCB'},
            {'count': 150, 'value': 'ICG'},
            {'count': 150, 'value': 'Institutional Clients Group'},
            {'count': 149, 'value': 'California'},
            {'count': 146, 'value': 'FMR'},
            {'count': 144, 'value': 'Featured Article'},
            {'count': 141, 'value': 'BMO Capital Markets'},
            {'count': 140, 'value': 'Montreal'},
            {'count': 139, 'value': 'UBS Group'},
            {'count': 137, 'value': 'Bank of Montreal'},
            {'count': 133, 'value': 'Featured Story'}]}

We can visualize the output of the Trends endpoint as a wordcloud to help us quickly interpret the most prevalent keywords.

In [32]:
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt

#convert data to dataframe for visualization
trends_data = pd.DataFrame(trends_dictionary['trends'])

subset = trends_data[['value', 'count']]
tuples = [tuple(x) for x in subset.values]

# Custom Colormap
from matplotlib.colors import ListedColormap # use when indexing directly yo a colour map

word_colours = [ 
            "#495B70"   # aylien navy
            , "#8BBE07" # aylien green
            , "#7A98B7" # grey
            , "#E77C05" # orange
            , "#0796BE" # blue
            , "#162542" # dark grey
        ]

# listed colour map
cmap = ListedColormap(word_colours)

wordcloud = WordCloud(background_color="white", width=800, height=400, colormap=cmap).generate_from_frequencies(dict(tuples))
plt.figure( figsize=(20,10) )
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

We have used a wordcloud to invesitage the most prominent entities in a one month period, but what if we want to investigate the frequency of mentions over time?

We can loop over the Trends endpoint and create a timeseries to investigate the distribution of entities over time.

First we will create a function to create a list of tupples containing daily intervals to allow us to search for trends daily within a defined period.

In [24]:
# the time format we need to submit for News API queries
AYLIEN_TIME_FORMAT = '%Y-%m-%dT%H:%M:%SZ'

def to_date(date):
    if not isinstance(date, datetime):
        date = str2date(date)
    return date.strftime(AYLIEN_TIME_FORMAT)

def str2date(string):
    return datetime.strptime(string, '%Y-%m-%d')

def get_intervals(start_date, end_date):
    start_date = str2date(start_date)
    end_date = str2date(end_date)
    return [(to_date(start_date + timedelta(days=d)),
             to_date(start_date + timedelta(days=d + 1)))
            for d in range((end_date - start_date).days)]

Next, we will define our date range, create a list of date tupples and iterate over those daily intervals to populate a dataframe that relates the entity, the number of times it was mentioned and the day the mentions occurred.

In [25]:
# define our daily intervals
day_intervals = get_intervals('2020-03-01', '2020-03-31')

# create dataframe in the format we want
my_columns = ['count', 'value', 'published_at']
trends_data_frame = pd.DataFrame(columns = my_columns)

# define the query parameters
params = {
  'title': 'Citigroup'
}

# define what trends we want to return
field = 'entities.body.text'

for day in tqdm(day_intervals):
    
    # define time interval
    params['published_at_start'] = day[0]
    params['published_at_end'] = day[1]

    try:
        api_response = api_instance.list_trends(field, **params)
    except ApiException as e:
        print("Exception when calling DefaultApi->list_time_series: %s\n" % e)

    #convert to dictionary
    api_response = api_response.to_dict()
    
    #covert to dataframe
    api_response = pd.DataFrame(api_response['trends'])

    # add in a day label
    api_response['published_at'] = params['published_at_start']
    
    # add to global dataframe
    trends_data_frame = trends_data_frame.append(api_response)

print("Completed")
100%|██████████| 30/30 [06:00<00:00, 12.03s/it]
Completed

We can loop over this dataframe and visualize the distribution of the different entities. Note, the code below visualizes only the top 10 entities.

In [26]:
# we will plot two subplots using the same axes
fig = make_subplots(rows=1, cols=1)

# identify the top ten entities
entities_total = trends_data_frame.groupby(['value'])['count'].agg('sum').reset_index().sort_values(by=['count'], ascending = False)

top_ten_entities = entities_total[0:10]['value'].unique()

# loop over postive and negative sentiment data to generate to line graphs
# start of for loop =======================================================================================  
for entity in top_ten_entities:

    # filter to the data we want to visualize based on sentiment    
    data = trends_data_frame[trends_data_frame['value'] == entity]

    fig.append_trace(go.Scatter(
        x = data['published_at']
        , y = data['count']
        , mode = 'lines'
        , name = entity
    ) 
    , col = 1
    , row = 1)

# end of for loop =======================================================================================

# forrmat the chart
fig.update_layout(
    title='Trending Entities Over Time',
    legend = dict(orientation = 'h', y = -0.1),
    plot_bgcolor='white',
    xaxis=dict(
        gridcolor='rgb(204, 204, 204)',
        linecolor='rgb(204, 204, 204)'
        )
    , yaxis=dict(
        gridcolor='rgb(204, 204, 204)',
        linecolor='rgb(204, 204, 204)'
        )
    , height=700
)

fig.show()

The Clusters Endpoint

Naturally, multiple news stories will exist that report on the same or similar topics. AYLIEN's clustering enrichment groups stories together that typically correspond to real-world events or topics. Clusters are made of stories that exist close to one another in vector space and the clustering enrichment links clusters to a "representative story" that exists in the centre of each cluster — reading this representative story provides an indication of the general nature of the entire cluster.

Similar to the timeseries and Trends endpoints, clusters enable us to review stories over time and identify points of interest. We can search for individual clusters using a a cluster ID, but similar to stories, we will generally not know the IDs of interest before we find them. Consequently, we can search for clusters using the Trends endpoint. The Trends endpoint allows us to filter clusters based on the stories contained within the clusters.

The Trends endpoint returns the id of clusters sorted by the count of stories associated with them. Once we have each cluster’s id, you can go on to get the stories for each of the clusters from the Stories endpoint. The Trends endpoint only returns the top 100 clusters for a given query.

The following script identifies clusters of news that feature the Citigroup entitiy using the Trends endpoint and returns the top 3 stories in each cluster, ranked by Alexa ranking.

In [27]:
# define the query parameters
params = {
  'published_at_start':'2020-01-01T00:00:00Z',
  'published_at_end':'2020-04-01T00:00:00Z',
  'entities_body_links_dbpedia': ['http://dbpedia.org/resource/Citigroup'],
  'field' : 'clusters'
}

def get_cluster_from_trends():

    """
    Returns a list of up to 100 clusters that meet the parameters set out.
    """
    response = api_instance.list_trends(**params)

    return [item.value for item in response.trends]


def get_cluster_metadata(cluster_id):

    """
    Returns the representative story, number of stories, and time value for a given cluster
    """

    response = api_instance.list_clusters(
        id=[cluster_id]
    )

    clusters = response.clusters

    if clusters is None or len(clusters) == 0:
        print('None XXX')
        return None

    first_cluster = clusters[0]

    return {
        "cluster": first_cluster.id,
        "representative_story": first_cluster.representative_story,
        "story_count": first_cluster.story_count,
        "time": first_cluster.time
    }


def get_top_stories(cluster_id):
    """
    Returns 3 stories associated with the cluster from the highest-ranking publishers
    """
    
    response = api_instance.list_stories(
        clusters=[cluster_id],
        sort_by="source.rankings.alexa.rank.US",
        per_page=3
    )
    
    stories = response.stories
    convert_to_dict(stories) 

    return response.stories


clusters = {}
cluster_ids = get_cluster_from_trends()
clusters_output = []

# loop through cluster IDs and print a progress bar
for cluster_id in tqdm(cluster_ids):
    metadata = get_cluster_metadata(cluster_id)
    
    # convert representative story object to python dictionary
    metadata['representative_story'] = metadata['representative_story'].to_dict()
    
    if metadata is not None:
        stories = get_top_stories(cluster_id)
        metadata["stories"] = stories
        clusters_output.append(metadata)
    else:
        print("{} empty".format(cluster_id))
    
    # sleep so that we don't exceed max hits per minute
    time.sleep(1)
        
print('Complete')
100%|██████████| 100/100 [03:14<00:00,  1.95s/it]
Complete

If we look at the first 3 clusters returned, we can see the number of stories associated with each cluster, the representative story title and the top 3 ranked stories.

In [28]:
for cluster in clusters_output[0:3]:
    print('Cluster ID: ' + str(cluster['cluster']))
    print('Story Count: ' + str(cluster['story_count']))
    print('Representative Story Title: ' + str(cluster['representative_story']['title']))
    print('Top ranked stories in cluster:')
    for story in cluster['stories']:
        indent_string = '   >  '
        print(indent_string + story['title'])
    
    print('')
Cluster ID: 102164680
Story Count: 685
Representative Story Title: T-Mobile Us Inc (NASDAQ:TMUS) Shares Sold by DNB Asset Management AS
Top ranked stories in cluster:
   >  Private Trust Co. NA Sells 154 Shares of S&P Global Inc (NYSE:SPGI)
   >  Automatic Data Processing (NASDAQ:ADP) Shares Acquired by Private Trust Co. NA
   >  Mastercard Inc (NYSE:MA) Stock Position Increased by Simon Quick Advisors LLC

Cluster ID: 108530637
Story Count: 1005
Representative Story Title: Cubist Systematic Strategies LLC Has $4.35 Million Holdings in American Tower Corp (NYSE:AMT)
Top ranked stories in cluster:
   >  ABIOMED, Inc. (NASDAQ:ABMD) Shares Purchased by Cubist Systematic Strategies LLC
   >  Cubist Systematic Strategies LLC Has $7.13 Million Stake in Corning Incorporated (NYSE:GLW)
   >  Cubist Systematic Strategies LLC Raises Stock Position in TechnipFMC PLC (NYSE:FTI)

Cluster ID: 102118099
Story Count: 177
Representative Story Title: JPMorgan Chase Earnings: JPM Stock 2% Higher on Strong Revenue Beat
Top ranked stories in cluster:
   >  $1.82 Earnings Per Share Expected for Citigroup Inc (NYSE:C) This Quarter
   >  JPMorgan Chase & Co. (NYSE:JPM) Releases Earnings Results, Beats Expectations By $0.25 EPS
   >  Analysts Expect Citigroup Inc (NYSE:C) Will Announce Earnings of $1.84 Per Share

Visualizing Cluster Data

We can easily visualize the cluster data to make it more easily digestable and understandable. Below we'll convert it to a Pandas dataframe and then visualize with Plotly.

In [29]:
# create dataframe in the format we want
my_columns = ['cluster_id', 'representative_story_title', 'representative_story_permalink', 'published_at', 'story_count']
clusers_data_frame = pd.DataFrame(columns = my_columns)

for cluster in clusters_output:
    
    data = [[
                cluster['cluster']
                , cluster['representative_story']['title']
                , cluster['representative_story']['permalink']
                , cluster['representative_story']['published_at']
                , cluster['story_count']
            ]]
    
    data = pd.DataFrame(data, columns = my_columns)
    clusers_data_frame = clusers_data_frame.append(data, sort=True)
    
clusers_data_frame['published_at'] = pd.to_datetime(clusers_data_frame['published_at'], utc = True)

pd.set_option('display.max_rows', 100)
clusers_data_frame = clusers_data_frame.sort_values(by=['story_count'], ascending = False).reset_index(0)

# convert story count to plotly friendly format
clusers_data_frame['story_count'] = clusers_data_frame['story_count'].astype(np.int64)

clusers_data_frame.head()
Out[29]:
index cluster_id published_at representative_story_permalink representative_story_title story_count
0 0 108530637 2020-03-26 13:33:37+00:00 https://www.dispatchtribunal.com/2020/03/26/cu... Cubist Systematic Strategies LLC Has $4.35 Mil... 1005
1 0 102173544 2020-01-06 15:25:13+00:00 https://federalnewsnetwork.com/government-news... Asian markets slide on alarm over Mideast tens... 868
2 0 108383763 2020-03-27 11:22:18+00:00 https://www.dispatchtribunal.com/2020/03/27/go... Goldman Sachs Group Inc. Sells 1,083,601 Share... 801
3 0 102164680 2020-01-13 17:47:29+00:00 https://www.com-unik.info/2020/01/13/t-mobile-... T-Mobile Us Inc (NASDAQ:TMUS) Shares Sold by D... 685
4 0 106657753 2020-03-08 10:17:00+00:00 https://www.dailypolitical.com/2020/03/08/lloy... Lloyds Banking Group PLC (NYSE:LYG) Shares Acq... 608

Split Title String into Substrings

Here we will add a break tag after every eighth word so that the title text fits neatly into tooltips in our graph.

In [30]:
# break title string into mutltiple lines so that it fits in a tooltip on our graph
title_strings = []

for index, row in clusers_data_frame.iterrows():
    word_array = row['representative_story_title'].split()
    counter = 0
    string = ''
    for word in word_array:
        if counter == 7:
            string += (word + '<br>')
            counter = 0
        else:
            string += (word + ' ')
            counter += 1
    title_strings.append(string)
    
clusers_data_frame['title_string'] = (title_strings)

Visualize the Clusters in a scatterplot

In [31]:
fig = go.Figure(data=go.Scatter(
    x=clusers_data_frame['published_at'],
    y=clusers_data_frame['story_count'],
    mode='markers',
    marker=dict(
                size=clusers_data_frame['story_count']/10
                , line = dict(width=2, color = colours['neutral'])
                , color = colours['neutral' + '_opaque']
                ),
    hovertext = clusers_data_frame['title_string']
))

fig.update_layout(
    height=700 
    
)

# forrmat the chart
fig.update_layout(
    title='Story Clusters Over Time',
    legend = dict(orientation = 'h', y = -0.1),
    plot_bgcolor='white',
    xaxis=dict(
        gridcolor='rgb(204, 204, 204)',
        linecolor='rgb(204, 204, 204)'
        )
    , yaxis=dict(
        gridcolor='rgb(204, 204, 204)',
        linecolor='rgb(204, 204, 204)'
        )
    , height=700
)

fig.show()

Conclusion

Here we have given a quick introduction in how to get up and running with four of the AYLIEN News' API's most frequently used endpoints. With these code and visualization examples, you should be able to start exploring news data in no time!