Press "Enter" to skip to content

Parsing Reddit comments – Python Reddit API Wrapper (PRAW) tutorial p.2


00:00:01what’s going on everybody welcome to

00:00:03part two of the Python reddit API

00:00:05wrapper or prawn tutorial mini-series in

00:00:08this tutorial what were we talking about

00:00:10is at least beginning to parse comments

00:00:13so like I said but at the end of the

00:00:15last video comments represent a

00:00:16different kind of challenge for a

00:00:18variety of reasons mainly it’s just the

00:00:20fact that comments aren’t you know

00:00:22perfectly in order there it’s a tree of

00:00:25data it’s not a linear form of data so

00:00:29anyways I’m going to go ahead and remove

00:00:32a subreddit that subscribe but the rest

00:00:34of this stuff can remain so just

00:00:37underneath this let’s go ahead and

00:00:38continue so the first thing we could do

00:00:41is first of all I want to limit this to

00:00:43there’s there there are two stickies so

00:00:47I’m just going to limit this to three

00:00:48just so we don’t go you know so we just

00:00:51do one submission for now and now I’m

00:00:54going to come down here and we can

00:00:58reference the comments by just saying

00:01:00comments equals submission dot

00:01:03comments so this gives us the comments

00:01:06so now we can do is to say for comment

00:01:09in comments we can go ahead let’s go

00:01:13ahead let’s like print 20 times this but

00:01:18let give us some separation and then

00:01:20what we’re going to do is we’re going to

00:01:21print comment but just like a submission

00:01:24the comments are like these objects like

00:01:27the perot object and the object is just

00:01:31going to have the ID so then you

00:01:33reference an attribute and one of the

00:01:34attributes is body for the body of that

00:01:37comment and then what we’re going to say

00:01:40is so that’s our that’s our comment so

00:01:42we can at least iterate through comments

00:01:44that way so for example let’s just run

00:01:45that real quick this here your shirt

00:01:51here so these are like all our you know

00:01:55comments now let me pull up that what

00:02:01they’re just close out of it I guess I

00:02:04closed out of it

00:02:06[Music]

00:02:07pull over mine so that was why so

00:02:12there’s six comments total here but some

00:02:15of these are like replies like for

00:02:17example if you’re unfamiliar do yourself

00:02:19a favor and look into pandas so for

00:02:22example if you made me look for this

00:02:26army okay okay anyway it’s not here okay

00:02:31so what we have to do is iterate through

00:02:33it at least I’m pretty sure it’s not

00:02:34there so these would be just like top

00:02:37levels I’m pretty sure I just want to be

00:02:45a hundred percent sorry for wasting your

00:02:47time anyway so I think I closed again I

00:02:51cus I’m bad at closing things anyway I’m

00:02:54pretty sure it’s not there so what we

00:02:55need to do is get the replies so now we

00:02:58could say you know for reply so for or

00:03:06rather prot what we should do is we

00:03:08there might not be any replies so then

00:03:11we could say if lend comment dot replies

00:03:16is greater than zero and again if you

00:03:18didn’t know replies existed you could

00:03:20have done a Durer on comment Abadi or

00:03:22you can read the documents anyway

00:03:25if when comment our replies is greater

00:03:27than zero so we have some replies then

00:03:28order is a for reply in comment dot

00:03:31replies hmm we get loops that’s not a

00:03:35that’s a thank you anyway

00:03:38we can print then let’s just say like

00:03:41for blog that’s why and also we got body

00:03:48on that

00:03:55okay so here you get a it’s just me

00:03:58reply really great high-quality reply

00:04:02yeah okay so oh and here’s another reply

00:04:08I was like this really isn’t another one

00:04:09yet so this is the this is that comma I

00:04:11just searched for a second ago so there

00:04:13we caught that reply about pandas but

00:04:16then I think I close this let me open it

00:04:20again

00:04:20someone complained I wanted my videos

00:04:22like I just murder my Enter key it’s

00:04:25true uh okay if you’re there you go so

00:04:31so pan is looking to pandas but then

00:04:32there’s another comment underneath that

00:04:34right so then we would have to be like

00:04:36um you know we did we’d have to just

00:04:40basically okay and then at this plant

00:04:41reply we could say okay if when reply

00:04:44dot replies is greater than there but

00:04:46you have no idea how deep down the

00:04:48rabbit hole the comment tree things go

00:04:50right so that’s that’s slightly

00:04:52problematic then so the solution is we

00:04:57can actually say submission comments we

00:05:03can add dot lists to these and this will

00:05:06list out your all of the comments so dot

00:05:10list I believe is purely a Python reddit

00:05:14API wrapper so purely a prof. um ssin

00:05:17allottee that’s not something that’s

00:05:18actually available to you in the Python

00:05:20alright it’s not something that’s

00:05:22actually available to you even the

00:05:23reddit API but anyways that doesn’t

00:05:25matter

00:05:26let me go ahead and close this so we’ve

00:05:27got a nice clean thing and then also we

00:05:31uh we kind of want to do like print

00:05:34comment body we don’t really want to do

00:05:36the replies so let’s just do that to

00:05:41cancel this real quick

00:05:46so in this case we’ve run through all of

00:05:50them so here you go here’s a the

00:05:53second-level reply now unfortunately we

00:05:56have no absolutely no idea the

00:05:57contextual data for this like we don’t

00:05:59really know where this this was in the

00:06:02whole thing so for example you know you

00:06:05wouldn’t really know that this was in

00:06:07reply to you know which reply it was to

00:06:10now

00:06:11what list does is basically it takes all

00:06:13the top-level comments list those out

00:06:16then it goes down to the second level

00:06:18comments lists all those out then third

00:06:20level and so on so one option you have

00:06:22is rather than comment body what you

00:06:25could say is you can also grab like you

00:06:28could you can grab a print the parent ID

00:06:34and that would be comment dot parent now

00:06:39do you note that’s not an attribute

00:06:40that’s an actual new API call which in

00:06:44my opinion is super unfortunate I wish

00:06:46that was supplied and I don’t think

00:06:48that’s a mistake I believe that’s that’s

00:06:52just in reddit and I realize not every

00:06:54comment is going to necessarily have a

00:06:55parent but pretty much every comment

00:06:57would write like you know the parent is

00:06:59the actual submission or the parent is

00:07:01another comment so and these are like

00:07:04little tiny ID strings like I really

00:07:07think that should be included but it’s

00:07:08not it’s a new API call

00:07:10so anyway comment ID so comment that

00:07:14parent and rather than that this one is

00:07:15just comment ID which just is actually

00:07:18an attribute so huh crazy I can’t

00:07:23remember if a submission I’m pretty sure

00:07:25like the submission contains the

00:07:27subreddit ID so I love to give wrong

00:07:31though anyway that’s okay so now what we

00:07:34could do is get the parent ID in the

00:07:35comment idea of every comment and then

00:07:39what we could do is print the comment

00:07:40body

00:07:45and then you’ve got the parent ID in the

00:07:46comments idea of everything now from

00:07:50that point you could begin to do some

00:07:53pretty cool stuff but the first thing I

00:07:55want to show you is right let’s say

00:07:57let’s say we don’t do Python and instead

00:07:59we do news so very very popular

00:08:01subreddit and if this doesn’t work I’ll

00:08:03do like politics or something but we

00:08:06should hit an error here let’s go there

00:08:13we go here we go there’s error so if you

00:08:16use the dot list and you actually do

00:08:18iterate through all comments chances are

00:08:20eventually you’re going to wind up with

00:08:22this stupid error so more comments

00:08:25object has no attribute parent ok so

00:08:28what’s happening there is like on really

00:08:31long comment chains so like for example

00:08:34let me go to the news subreddit that

00:08:40would be this one marijuana company buys

00:08:42entire US time to create cannabis from

00:08:44the municipality that’s going to have

00:08:47lots of comments so for example right

00:08:49away you can see here this like load

00:08:53more comments that’s a more comments

00:08:56object and actually even though red it

00:08:59looks super simple they’re going to that

00:09:00till you click this I’m pretty sure

00:09:01you’re making a new call like it’s an

00:09:04actual call to their database same thing

00:09:07would like continue this thread that’s a

00:09:08new call it’s going to reload that data

00:09:10like all this data is not loaded on your

00:09:12page load that would be nuts you never

00:09:14load the page so anyways if you wanted

00:09:17to continue iterating through those

00:09:19comments you would need to also either

00:09:21handle with a you know an exception or

00:09:23something like that or one option you

00:09:25have is to replace the mores so for

00:09:29example coming down here comments that

00:09:31list one option you have is so you could

00:09:41you can just use dot replace more kind

00:09:43of starting to add a little too many um

00:09:45a little too many things here but let’s

00:09:50just do

00:09:52I’ll do I’ll add the dot list down here

00:09:54and then what we’ll say is dot replace

00:09:59underscore more and then for now we’ll

00:10:02say limit equals zero but at some point

00:10:04you will run into limits with the

00:10:06replace more like there’s only so many

00:10:07more it will add I think it’s 30 or

00:10:09something like that which is so fond of

00:10:12comments because like each replace more

00:10:14will load in a bunch of comments but

00:10:16just keep that in mind like you’re you

00:10:17you’re going to run out eventually

00:10:20but it won’t air if you do run out of

00:10:21the option to continue replacing instead

00:10:24it’s just going to toss them so you

00:10:25won’t hit an actual error anymore so

00:10:27anyways let’s let’s go ahead and run

00:10:28this real quick and probably I should

00:10:30remove the parent call that’s going to

00:10:32slow me down

00:10:35Walt hmm let’s see submission dot

00:10:42comments okay replace more hmm

00:10:46okay fine fine fine one one okay dot

00:10:48list and then we’ll come over here

00:10:50comments that replace more okay so first

00:10:53we we’ve converted it to list form which

00:10:55then creates this more comment object

00:10:58and now we can replace them I just did

00:11:00it backwards this should work that’s

00:11:03still going to be a lot of queries to

00:11:04the API but hopefully we’ll get through

00:11:06it are you kidding me please what have I

00:11:14done what have I done

00:11:17comments dot replace more so comment

00:11:19equals submission that comments

00:11:25I think I had it right the first time so

00:11:32comments equals submission comments

00:11:39please

00:11:41so where is a submission comments that

00:11:47replace more limit equals 0 now for

00:11:53comment in comments let’s see no.4

00:11:58comment in submission comments I really

00:12:04feel like I should have been able to

00:12:05string that someone can comment below

00:12:07what the fix should have been because I

00:12:09don’t see why I wasn’t able to string

00:12:11those together but obviously messing up

00:12:12something so for comment in submission

00:12:15comments that list let me try that drink

00:12:21some more coffee 1 Matic there we go

00:12:27not a problem that’s going forever

00:12:30though I’m going to go I’m just going to

00:12:31break that pencil

00:12:34API calls eventually it would probably

00:12:36throttle me anyway as you can see now

00:12:39we’ve got all the parent IDs the comment

00:12:40IDs everything’s hunky-dory we’re doing

00:12:42great so go ahead and close this out so

00:12:47so that’s how you can iterate through

00:12:50all the comments and all that now now

00:12:55the question is you know how might you

00:12:57rebuild that comment tree right because

00:13:00at some point right like you’ve got to

00:13:02rebuild that tree so for example one

00:13:05option you could have is like build a

00:13:08dictionary or something like that and

00:13:10then each of like the you know like the

00:13:12parent you’ve got a parent ID and then

00:13:16the parent content and then all the

00:13:19replies so a parent ID content all

00:13:21replies parent ID kind of all replies

00:13:23and if you did that you could rebuild

00:13:25the tree yourself now I’m not going to

00:13:27go ahead and go through all that I don’t

00:13:29really see too much point covering that

00:13:31in video but if you are interested in

00:13:33that you can go to part 2 of this

00:13:35tutorial series on Python programming

00:13:38and there’ll be an example there if

00:13:39you’re interested in truly rebuilding

00:13:41those comment trees that’s one way you

00:13:43could do it that’s how I would do it

00:13:45anyway if you have a better way I’m sure

00:13:47somebody could come up with a better way

00:13:49anyways so now in the next tutorial

00:13:53we’re going to talk about is basically

00:13:56just streaming from reddit so this has

00:13:59all been like historical grabbing from

00:14:01reddit but there’s also a way you can

00:14:03actually just stream data from reddit so

00:14:05anyways that’s all going to be doing in

00:14:06the next tutorial if you’ve got

00:14:07questions comments concerns whatever

00:14:09feel free to them below otherwise I will

00:14:10see you in the next trip