Total download counts differs from Google BigQuery

A few months ago I released the OpenActive Python package on both conda-forge and PyPI:
https://anaconda.org/conda-forge/openactive
https://pypi.org/project/openactive/

I’m now checking on download counts for organisation monitoring and reporting needs. I understand that download counts aren’t necessarily always super accurate, but even so I’d still like to get an idea.

From the OpenActive conda-forge page itself, it says 328 total downloads at the time of writing. PyPI doesn’t give download counts directly, but rather directs people to use Google BiqQuery instead. I just set this up and ran the following SQL command:

SELECT
  details.installer.name AS `channel`,
  COUNT(*) AS num_downloads,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
  file.project = 'openactive'
GROUP BY `channel`
ORDER BY `channel`

From which I get the following results:

1 null          158
2 Browser       214
3 Nexus         6
4 bandersnatch  872
5 conda         12
6 devpi         1
7 pip           111
8 requests      129

So now I can see the pip downloads, which is what I was really after, but conda is also reported … at a much lower 12 counts compared to the 328 counts as mentioned above.

My question is - why is there a difference, and can I therefore really trust either source at all, for any channel?

Any help or insights much appreciated. Thanks a lot in advance for any response.

Is that conda downloading things from PyPI? Because in that case that’s only what’s added to a conda environment.yml file in the pip: section, not direct conda install ... calls.

You can find conda-forge download data at GitHub - ContinuumIO/anaconda-package-data: Conda package download data, or via condastats.

1 Like

Hi Jaime,

Thanks for your response. I had a look at the anaconda-package-data and condastats packages, which unfortunately led into a rabbit hole of errors and issues (details for the former are below for interest). After some effort, I did manage to get anaconda-package-data running locally by pip installing intake, intake-parquet, python-snappy, requests, aiohttp, jinja2, dask, and s3fs. Either conda installing these packages or conda installing anaconda-package-data led to an error at some point during execution. I could run the following intake.open_catalog command in all cases, but the cat.anaconda_package_data_by_month command failed for dates before March 2024 with the conda installed methods.

So just focussing on the successful method, I find:

>>> import intake
>>> cat = intake.open_catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')
>>> m0 = cat.anaconda_package_data_by_month(year=2023, month=12).to_dask()
>>> m1 = cat.anaconda_package_data_by_month(year=2024, month=1).to_dask()
>>> m2 = cat.anaconda_package_data_by_month(year=2024, month=2).to_dask()
>>> m3 = cat.anaconda_package_data_by_month(year=2024, month=3).to_dask()
>>> m4 = cat.anaconda_package_data_by_month(year=2024, month=4).to_dask()
>>> m5 = cat.anaconda_package_data_by_month(year=2024, month=5).to_dask()

First check the number of OpenActive entries each month:

>>> len(m0.loc[m0['pkg_name']=='openactive'])
0
>>> len(m1.loc[m1['pkg_name']=='openactive'])
0
>>> len(m2.loc[m2['pkg_name']=='openactive'])
0
>>> len(m3.loc[m3['pkg_name']=='openactive'])
40
>>> len(m4.loc[m4['pkg_name']=='openactive'])
91
>>> len(m5.loc[m5['pkg_name']=='openactive'])
138

I would have expected some non-zero entries from January onwards, as that’s when this package launched.

Now check the number of OpenActive download counts each month:

>>> sum(m0.loc[m0['pkg_name']=='openactive']['counts'])
0
>>> sum(m1.loc[m1['pkg_name']=='openactive']['counts'])
0
>>> sum(m2.loc[m2['pkg_name']=='openactive']['counts'])
0
>>> sum(m3.loc[m3['pkg_name']=='openactive']['counts'])
176
>>> sum(m4.loc[m4['pkg_name']=='openactive']['counts'])
291
>>> sum(m5.loc[m5['pkg_name']=='openactive']['counts'])
444

So the total to date from this reckoning is 176 + 291 + 444 = 911, which is now a third different download counts figure for conda! I am more confused than before.


Issues

Four methods discussed here to setting up reading anaconda-package-data:

  • Method-1) conda install: intake, intake-parquet, python-snappy, requests, aiohttp, jinja2, dask, and s3fs
  • Method-2) pip install: intake, intake-parquet, python-snappy, requests, aiohttp, jinja2, dask, and s3fs
  • Method-3) conda install: anaconda-package-data
  • Method-4) pip install: anaconda-package-data … not actually possible as anaconda-package-data is not on pip

Issue-1
The anaconda-package-data binder notebook fails on its imports.
The anaconda-package-data binder dashboard doesn’t load properly at all.

Issue-2
The GitHub readme says to use intake.Catalog, but this results in an error message saying to use intake.open_catalog instead now, so the readme needs updating. This useful error only appears for method-1 and method-3, and for method-2 we get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/readers/entry.py", line 236, in __init__
    [self.add_entry(e) for e in entries]
  File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/readers/entry.py", line 236, in <listcomp>
    [self.add_entry(e) for e in entries]
  File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/readers/entry.py", line 250, in add_entry
    entry.kwargs = find_funcs(entry.kwargs, tokens)
AttributeError: 'str' object has no attribute 'kwargs'

Issue-3
After getting cat via intake.open_catalog, we run the following code as per the GitHub readme:

monthly = cat.anaconda_package_data_by_month(year=2019, month=12).to_dask()

This only works for method-2 for this date as used in the readme, and results in the following error for method-1 and method-3. All methods do however seem to work with recent dates though.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/intake_parquet/source.py", line 107, in to_dask
    df = dd.read_parquet(self._urlpath, storage_options=self._storage_options,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask_expr/_collection.py", line 5433, in read_parquet
    ReadParquetFSSpec(
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask_expr/_core.py", line 57, in __new__
    _name = inst._name
            ^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/functools.py", line 995, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask_expr/io/parquet.py", line 776, in _name
    funcname(type(self)), self.checksum, *self.operands[:-1]
                          ^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask_expr/io/parquet.py", line 782, in checksum
    return self._dataset_info["checksum"]
           ^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask_expr/io/parquet.py", line 1375, in _dataset_info
    meta = self.engine._create_dd_meta(dataset_info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask/dataframe/io/parquet/arrow.py", line 1215, in _create_dd_meta
    meta = cls._arrow_table_to_pandas(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask/dataframe/io/parquet/arrow.py", line 1878, in _arrow_table_to_pandas
    res = arrow_table.to_pandas(categories=categories, **_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 884, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 4192, in pyarrow.lib.Table._to_pandas
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 776, in table_to_dataframe
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 1131, in _table_to_blocks
    return [_reconstruct_block(item, columns, extension_columns)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 736, in _reconstruct_block
    pd_ext_arr = pandas_dtype.__from_arrow__(arr)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/pandas/core/arrays/string_.py", line 217, in __from_arrow__
    return ArrowStringArray(array)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/pandas/core/arrays/string_arrow.py", line 143, in __init__
    raise ValueError(
ValueError: ArrowStringArray requires a PyArrow (chunked) array of large_string type

Issue-4
The following code in the readme is only mentioned for method-3:

monthly = intake.cat.anaconda_package_data_by_month(year=2019, month=12).to_dask()

Just trying it with method-1 and method-2, it results in error as could be expected, but these errors are different. The error for method-1 is:

Traceback (most recent call last):
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/intake/catalog/base.py", line 402, in __getattr__
    return self[item]  # triggers reload_on_change
           ~~~~^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/intake/catalog/base.py", line 475, in __getitem__
    raise KeyError(key)
KeyError: 'anaconda_package_data_by_month'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/intake/catalog/base.py", line 404, in __getattr__
    raise AttributeError(item) from e
AttributeError: anaconda_package_data_by_month

The error for method-2 is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/__init__.py", line 76, in __getattr__
    gl[attr] = import_name(dest)
  File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/utils.py", line 31, in import_name
    mod = getattr(mod, bit)
  File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/__init__.py", line 25, in __getattr__
    builtin = _make_builtin()
  File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/__init__.py", line 15, in _make_builtin
    [EntrypointsCatalog(), load_combo_catalog()],
  File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/local.py", line 933, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/base.py", line 128, in __init__
    self.force_reload()
  File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/base.py", line 186, in force_reload
    self._load()
  File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/local.py", line 944, in _load
    for name, entrypoint in catalogs.items():
AttributeError: 'tuple' object has no attribute 'items'

More importantly, we also have an error for method-3, which is the only method it is presented for in the readme. This occurs regardless of year above 2018. And if we change to 2018 and lower, then we get a PyArrow error as seen in issue-3:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/source/base.py", line 399, in configure_new
    obj = self._entry(**kw)
          ^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/catalog/entry.py", line 79, in __call__
    s = self.get(**kwargs)
        ^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/catalog/local.py", line 272, in get
    plugin, open_args = self._create_open_args(user_parameters)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/catalog/local.py", line 243, in _create_open_args
    open_args = merge_pars(params, user_parameters, self._user_parameters, getshell=self.getshell, getenv=self.getenv, client=False)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/catalog/utils.py", line 232, in merge_pars
    context[par.name] = par.validate(val)
                        ^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/catalog/local.py", line 118, in validate
    raise ValueError("%s=%s is greater than %s" % (self.name, value, self.max))
ValueError: year=2019 is greater than 2018