by xhedley on 1/31/21, 1:50 PM
First question posed to data set: when is the best time to post to Hacker News to maximise expected page rank?
Given you posted at your calculated time of 11:00 UTC on a Sunday, will be interesting to see how it does. It’s at 13 right now.
by mpeteuil on 1/31/21, 6:11 PM
Nice post! For your "How popular is investing in the community?" section, I would try to adjust that so it's not an absolute number. Right now you don't know whether "investing" seems more popular just because there are more users or stories overall or because a higher proportion of the items submitted are about investing. For example, instead of the raw number of stories that month with "investing" in them, look at something like that number divided by the total number of stories for that month, which would give you the fraction of stories that month that contained the term "investing".
by indrayam on 1/31/21, 3:29 PM
As someone who never thought of using Snowflake for personal use, what was the cost of setting up and using Snowflake? Did you use Snowflake in AWS? FWIW: I am super new to the Snowflake ecosystem
by alex_young on 1/31/21, 1:27 PM
“3,850 requests per second” to hacker news? Sounds like a denial of service attack ;)
by leetrout on 1/31/21, 1:51 PM
Question to the author and anyone else here:
Do you find the advice to reboot the VM after making a security change useful?
It will certainly work but from my understanding it should be able to change that setting and alter the users login such that you only need to logout and back in. Maybe I am mistaken?
I try very hard to avoid reboots - it may be distracting for a focused article like this - so curious what others think.
by asah on 1/31/21, 4:36 PM
nice.
FYI that stock regexp is buggy and e.g. will match $42.36 which obviously isn't a symbol.
Indeed, symbols are tough enough to nail correctly e.g. people not including the $, symbols with periods, etc.
Offhand, I'd build a little neural net classifier (e.g. https://fasttext.cc/ ) and train this on a slew of example-posts that are/aren't about stonks. To get training data, use regexps and then run through them by hand (20+ per minute per hour = 1200/hour, or outsource to amazon mturk $0.25 per 10, incl verification = $30/1200). Also, there's probably easy ones you can classify 100% correctly with regexps, to increase the training set size.
I'm happy to help if you like.
by klabetron on 2/1/21, 8:41 AM
Regarding combining to a single file for loading: not necessary. You can have as many files in your bucket as you like. Just make them JSON Lines (one object per line, not an array). The COPY command will even skip files it has already loaded. (I use this for a couple automations every day.)
by fuy on 1/31/21, 10:22 PM
Interesting read, thanks!
Regarding generating missing ids (quote from the blogpost: "If you know how one would generate the missing ids between the gaps (which could be of variable size)":
I don't have access to snowflake instance, but the following works in Postgres (if I understood the problem correctly):
with lead as (
select id, lead(id, 1) over (order by id) as lead_by_one
from gaps_table),
gaps as (
select id + 1 as start_gap,
lead_by_one - 1 as end_gap
from lead where lead_by_one - id > 1)
select generate_series(start_gap, end_gap) as missing_ids from gaps
So if Snowflake has a generating function similar to generate_series, it should do the trick.
by wallawaz on 1/31/21, 3:55 PM
Thanks for sharing!
I've recently starting using Snowflake - it definitely has some nice features (cloning, storage integrations, etc). I'm interested what tooling you used to generate your visualizations. Were the charts generated directly from query output or did you need to load your query results into something like seaborn?
by dm13450 on 1/31/21, 8:13 PM
Minor point, but when calculating the avg_diff_price for $GME you should be calculating a return (close-open)/open otherwise days where the stock went from 100 to 105 (5% increase) look the same as days when it went from 5 to 10 (100% increase).
Likewise, when calculating the correlations, that should be done on returns and not prices.
by dvfjsdhgfv on 1/31/21, 8:03 PM
> you will need to add the following line to limits.conf and then reboot the machine
Is this normal in AWS? On most systems I worked with it's not true, you just need to start a new session - a reboot is not necessary.
by rsync on 1/31/21, 6:42 PM
Question for mods/dang/etc.: How interesting was it that a third party hit HN @ 300 requests per second for a total of 25M hits ?
by bthomas on 1/31/21, 5:25 PM
Off topic: what do you use for CMS? I like how the code blocks and graphs are integrated. Great post!
by enz on 1/31/21, 9:13 PM
Great use-cases for SQL window functions!