Hi Jaime,
Thanks for your response. I had a look at the anaconda-package-data
and condastats
packages, which unfortunately led into a rabbit hole of errors and issues (details for the former are below for interest). After some effort, I did manage to get anaconda-package-data
running locally by pip installing intake
, intake-parquet
, python-snappy
, requests
, aiohttp
, jinja2
, dask
, and s3fs
. Either conda installing these packages or conda installing anaconda-package-data
led to an error at some point during execution. I could run the following intake.open_catalog
command in all cases, but the cat.anaconda_package_data_by_month
command failed for dates before March 2024 with the conda installed methods.
So just focussing on the successful method, I find:
>>> import intake
>>> cat = intake.open_catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')
>>> m0 = cat.anaconda_package_data_by_month(year=2023, month=12).to_dask()
>>> m1 = cat.anaconda_package_data_by_month(year=2024, month=1).to_dask()
>>> m2 = cat.anaconda_package_data_by_month(year=2024, month=2).to_dask()
>>> m3 = cat.anaconda_package_data_by_month(year=2024, month=3).to_dask()
>>> m4 = cat.anaconda_package_data_by_month(year=2024, month=4).to_dask()
>>> m5 = cat.anaconda_package_data_by_month(year=2024, month=5).to_dask()
First check the number of OpenActive entries each month:
>>> len(m0.loc[m0['pkg_name']=='openactive'])
0
>>> len(m1.loc[m1['pkg_name']=='openactive'])
0
>>> len(m2.loc[m2['pkg_name']=='openactive'])
0
>>> len(m3.loc[m3['pkg_name']=='openactive'])
40
>>> len(m4.loc[m4['pkg_name']=='openactive'])
91
>>> len(m5.loc[m5['pkg_name']=='openactive'])
138
I would have expected some non-zero entries from January onwards, as that’s when this package launched.
Now check the number of OpenActive download counts each month:
>>> sum(m0.loc[m0['pkg_name']=='openactive']['counts'])
0
>>> sum(m1.loc[m1['pkg_name']=='openactive']['counts'])
0
>>> sum(m2.loc[m2['pkg_name']=='openactive']['counts'])
0
>>> sum(m3.loc[m3['pkg_name']=='openactive']['counts'])
176
>>> sum(m4.loc[m4['pkg_name']=='openactive']['counts'])
291
>>> sum(m5.loc[m5['pkg_name']=='openactive']['counts'])
444
So the total to date from this reckoning is 176 + 291 + 444 = 911, which is now a third different download counts figure for conda! I am more confused than before.
Issues
Four methods discussed here to setting up reading anaconda-package-data
:
- Method-1) conda install:
intake
, intake-parquet
, python-snappy
, requests
, aiohttp
, jinja2
, dask
, and s3fs
- Method-2) pip install:
intake
, intake-parquet
, python-snappy
, requests
, aiohttp
, jinja2
, dask
, and s3fs
- Method-3) conda install:
anaconda-package-data
- Method-4) pip install:
anaconda-package-data
… not actually possible as anaconda-package-data
is not on pip
Issue-1
The anaconda-package-data binder notebook fails on its imports.
The anaconda-package-data binder dashboard doesn’t load properly at all.
Issue-2
The GitHub readme says to use intake.Catalog
, but this results in an error message saying to use intake.open_catalog
instead now, so the readme needs updating. This useful error only appears for method-1 and method-3, and for method-2 we get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/readers/entry.py", line 236, in __init__
[self.add_entry(e) for e in entries]
File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/readers/entry.py", line 236, in <listcomp>
[self.add_entry(e) for e in entries]
File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/readers/entry.py", line 250, in add_entry
entry.kwargs = find_funcs(entry.kwargs, tokens)
AttributeError: 'str' object has no attribute 'kwargs'
Issue-3
After getting cat
via intake.open_catalog
, we run the following code as per the GitHub readme:
monthly = cat.anaconda_package_data_by_month(year=2019, month=12).to_dask()
This only works for method-2 for this date as used in the readme, and results in the following error for method-1 and method-3. All methods do however seem to work with recent dates though.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/intake_parquet/source.py", line 107, in to_dask
df = dd.read_parquet(self._urlpath, storage_options=self._storage_options,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask_expr/_collection.py", line 5433, in read_parquet
ReadParquetFSSpec(
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask_expr/_core.py", line 57, in __new__
_name = inst._name
^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/functools.py", line 995, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask_expr/io/parquet.py", line 776, in _name
funcname(type(self)), self.checksum, *self.operands[:-1]
^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask_expr/io/parquet.py", line 782, in checksum
return self._dataset_info["checksum"]
^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask_expr/io/parquet.py", line 1375, in _dataset_info
meta = self.engine._create_dd_meta(dataset_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask/dataframe/io/parquet/arrow.py", line 1215, in _create_dd_meta
meta = cls._arrow_table_to_pandas(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/dask/dataframe/io/parquet/arrow.py", line 1878, in _arrow_table_to_pandas
res = arrow_table.to_pandas(categories=categories, **_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 884, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 4192, in pyarrow.lib.Table._to_pandas
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 776, in table_to_dataframe
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 1131, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 736, in _reconstruct_block
pd_ext_arr = pandas_dtype.__from_arrow__(arr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/pandas/core/arrays/string_.py", line 217, in __from_arrow__
return ArrowStringArray(array)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/pandas/core/arrays/string_arrow.py", line 143, in __init__
raise ValueError(
ValueError: ArrowStringArray requires a PyArrow (chunked) array of large_string type
Issue-4
The following code in the readme is only mentioned for method-3:
monthly = intake.cat.anaconda_package_data_by_month(year=2019, month=12).to_dask()
Just trying it with method-1 and method-2, it results in error as could be expected, but these errors are different. The error for method-1 is:
Traceback (most recent call last):
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/intake/catalog/base.py", line 402, in __getattr__
return self[item] # triggers reload_on_change
~~~~^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/intake/catalog/base.py", line 475, in __getitem__
raise KeyError(key)
KeyError: 'anaconda_package_data_by_month'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/miniconda3/envs/anaconda-package-data/lib/python3.12/site-packages/intake/catalog/base.py", line 404, in __getattr__
raise AttributeError(item) from e
AttributeError: anaconda_package_data_by_month
The error for method-2 is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/__init__.py", line 76, in __getattr__
gl[attr] = import_name(dest)
File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/utils.py", line 31, in import_name
mod = getattr(mod, bit)
File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/__init__.py", line 25, in __getattr__
builtin = _make_builtin()
File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/__init__.py", line 15, in _make_builtin
[EntrypointsCatalog(), load_combo_catalog()],
File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/local.py", line 933, in __init__
super().__init__(*args, **kwargs)
File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/base.py", line 128, in __init__
self.force_reload()
File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/base.py", line 186, in force_reload
self._load()
File "/Users/darrentemple/Documents/OpenActive/openactive-python-stats/virt/lib/python3.8/site-packages/intake/catalog/local.py", line 944, in _load
for name, entrypoint in catalogs.items():
AttributeError: 'tuple' object has no attribute 'items'
More importantly, we also have an error for method-3, which is the only method it is presented for in the readme. This occurs regardless of year above 2018. And if we change to 2018 and lower, then we get a PyArrow error as seen in issue-3:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/source/base.py", line 399, in configure_new
obj = self._entry(**kw)
^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/catalog/entry.py", line 79, in __call__
s = self.get(**kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/catalog/local.py", line 272, in get
plugin, open_args = self._create_open_args(user_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/catalog/local.py", line 243, in _create_open_args
open_args = merge_pars(params, user_parameters, self._user_parameters, getshell=self.getshell, getenv=self.getenv, client=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/catalog/utils.py", line 232, in merge_pars
context[par.name] = par.validate(val)
^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/anaconda-package-data-2/lib/python3.12/site-packages/intake/catalog/local.py", line 118, in validate
raise ValueError("%s=%s is greater than %s" % (self.name, value, self.max))
ValueError: year=2019 is greater than 2018