Issue
I work on a project that maintains a data lake that centralizes public information from the Brazilian government. Our pipelines run on a Kubernetes cluster.
I'm currently building a pipeline for labor market data. This is the bash script I use to download the data:
#!/bin/bash
# To run this script the user must run 'bash download.sh group', where group is cagedmov | cagedfor | cageddex.
# See explanation in the next comment:
# The microdata resulting from the new consolidation are made available in accordance with the
# month of disclosure, as of January 2020, containing three files for each
# competence. Following a consistent naming pattern, CAGEDMOVAYYYMM files
# bring the movements declared within the deadline with declaration competence
# same as YYYYMM. The CAGEDFORAAAMM files bring the declared moves
# outside the deadline with declaration competence equal to YYYYMM. the files
# CAGEDEXCAAAAMM bring the excluded movements with declaration competence
# of exclusion equal to YYYYMM
lower_group=$1
upper_group=${lower_group^^}
mkdir -p /tmp/novo_caged/$lower_group/input
ufs=('RO' 'AC' 'AM' 'RR' 'PA' 'AP' 'TO' 'MA' 'PI' 'CE' 'RN' 'PB' 'PE' 'AL' 'SE' 'BA' 'MG' 'ES' 'RJ' 'SP' 'PR' 'SC' 'RS' 'MS' 'MT' 'GO' 'DF')
anos=(2020 2021 2022)
meses=($(seq 1 1 12))
for uf in "${ufs[@]}"
do
for ano in "${anos[@]}"
do
for mes in "${meses[@]}"
do
mkdir -p /tmp/novo_caged/$lower_group/ano=$ano/mes=$mes/sigla_uf=$uf/
done
done
done
cd /tmp/novo_caged/$lower_group/input
ftp_path="ftp://anonymous:anonymous@ftp.mtps.gov.br/pdet/microdados/NOVO CAGED/"
pad_meses=($(echo {01..12}))
folders=($(seq 202001 1 202012))
for ano in "${anos[@]}"
do
for mes in "${pad_meses[@]}"
do
wget "$ftp_path$ano/$ano$mes/$upper_group$ano$mes.7z"
7z x -y $upper_group$ano$mes.7z
rm *7z
done
done
The script runs perfectly on my computer, but when I deploy to the Kubernetes cluster, the script throws an error Failed to connect to ftp.mtps.gov.br port 21: Connection timed out. Apparently, the address ftp.mtps.gov.br only accepts requests from IP addresses from Brazil. Is there a way to get around this restriction? It would be very important for our project to automate this ETL and have this data released in an updated way.
Solution
You can use Tor as a sockx5 proxy, and configure it to let traffic exit from a specific country.
In the torrc configuration file add these lines, or eventually modify existing ones.
ExitNodes {br}
StrictNodes 1
The last thing, you need to tell your bash script to use tor.
That can be done in different ways, the easiest one is to use the torify command.
I suggest to test everything adding this line at the top of the script
#!/bin/bash
curl https://api.myip.com;exit
This will give you evidence of the country being used as tor exit node. If it's ok, remove this test line.
https://www.torproject.org/
https://linux.die.net/man/1/torify
Answered By - franzisk Answer Checked By - Robin (PHPFixing Admin)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.