Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some html files result in UnicodeDecodeError when red by BeautifulSoup #60

Closed
denisalevi opened this issue Jun 22, 2016 · 7 comments
Closed
Assignees

Comments

@denisalevi
Copy link
Contributor

I couldn't find a solution, but it seems that the error only appears for 3 files (from all files downloaded so far) and all of them from the same download time. So I guess if nobody has already encountered and fixed this problem, just remove following files from the server:

accuweather_01-06-2016_17\:07_frankfurt_daily_d1_1464793657.html
accuweather_01-06-2016_17\:07_frankfurt_daily_d4_1464793657.html
accuweather_01-06-2016_17\:07_frankfurt_daily_d5_1464793657.html
@erensezener
Copy link
Contributor

But did I give you all of the accuweather data? or just a sample?

@erensezener
Copy link
Contributor

Yes, there are other files with the same problem. Can you handle this in your code?

@denisalevi
Copy link
Contributor Author

I think you gave me all accuweather data last week. I will try to handle it in my script, later today.

@denisalevi
Copy link
Contributor Author

Are the files you get the error for new files from last week?

@erensezener
Copy link
Contributor

No, for instance accuweather_02-06-2016_17:07_bielefeld_daily_d5_1464880053.html

@denisalevi
Copy link
Contributor Author

Why should I handle this in my script? Don't you except errors anyways and when error occurs, its just not gonna save it to the database. There is not more I can do anyways.

And since the scraped data still has some files which will give AssertionErrors (from my checks), e.g. the files from april which are france cities instead of german ones (we didnt delete those), we will have to except assertion errors anyways. I will write a log file where alle excepted errors are logged, so we can see what happens.

@denisalevi
Copy link
Contributor Author

Should be handled in your script as shown in #78

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants