Consume APIs in a Notebook¶
For less than 100 MB of data, you can fetch all the data with a call to the query API or from an API endpoint, using parameters if you wish. When there is a lot of data you can’t fetch all the data in one go. You need to query it little by little, with not more than 100 MB per API call. The solution is to get batches using Data Source sorting keys. Selecting the data by columns used in the sorting key ensures that it is fast.
In this example, the Data Source is sorted on the timestamp column, so we use batches of a fixed amount of time. In general, time is a good way to batch.
fetch_table_streaming_endpoint in the notebook work as generators. They should always be used in a for loop or as the input for another generator.
You should process each batch as it arrives and discard unwanted fetched data. Only fetch the data you need in the processing. The idea here is not to recreate a Data Source in the notebook but to process each batch as it arrives and write less data to your DataFrame.
Fetch data with the Query API¶
Here we use the requests library for Python. The SQL query pulls in an hour less of data than the full Data Source. A DataFrame is created from the text part of the response.
Fetch data from an API Endpoint with Parameters¶
This endpoint node in the pipe
endpoint_wiki selects from the Data Source within a range of dates, using the parameters for
These parameters are passed in the call to the API endpoint to select only the data within the range. A DataFrame is created from the text part of the response.
Fetch batches of data using the Query API¶
fetch_table_streaming_query in the notebook accepts more complex queries than a date range. Here you choose what you filter and sort by. This example reads in batches of 5 minutes to create a small DataFrame, which should then be processed, with the results of the processing appended to the final DataFrame.
Fetch batches of data from an API Endpoint with Parameters¶
fetch_table_streaming_endpoint in the notebook sends a call to the API with parameters for the batch size, start and end dates, and, optionally, filters on the ‘bot’ and ‘server_name’ columns. This example reads in batches of 5 minutes to create a small DataFrame, which should then be processed, with the results of the processing appended to the final DataFrame.
The endpoint ‘wiki_stream_example’ first selects data for the range of dates, then for the batch and then applies the filters on column values.
These parameters are passed in the call to the API endpoint to select only the data for the batch. A DataFrame is created from the text part of the response.
The full code for these examples is in this Colab notebook.