2. Objective:
• Install apache beam Python sdk in Google cloud platform environment.
• Create a pipeline with PCollections and then apply Count to get the
total number of elements in different ways such as
• Counting all elements in a PCollection
• Counting elements for each key
• Counting all unique elements
Counting all elements in a Pcollection:
• Count.Globally() to count all elements in a PCollection, even if there
are duplicate elements.
Counting elements for each key:
• Count.PerKey() to count the elements for each unique key in
a PCollection of key-values.
3. Counting all unique elements:
• Count.PerElement() to count the only the unique elements in
a PCollection.
Resources:
• https://cloud.google.com/dataflow/docs/guides/installing-beam-
sdk#python
• https://beam.apache.org/documentation/
• https://beam.apache.org/documentation/transforms/python/aggregation
/count/
4. Error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.7/tokenize.py", line 447, in open
buffer = _builtin_open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-041su91e/orjson/setup.py'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-041su91e/orjson/
Solution:
A) System Install
• sudo python3 -m pip install -U pip
• sudo python3 -m pip install -U setuptools
B) Virtual Env / Pipenv
#Within the venv
• pip3 install -U pip
• pip3 install -U setuptools
Finally try installing again with pip3 install apache-beam