Sunday, April 3, 2011

Re: [twitter-dev] need twitter spam for a research project

On Sun, 3 Apr 2011 18:19:38 -0700 (PDT), Jeff Tucker
<fred.f.chopin@gmail.com> wrote:
> I'm conducting a research project involving proactively identifying
> twitter spam accounts before they actually start spamming. I've
> observed that some spammers attempt to create tweets that look like
> they're a legitimate account prior to actually sending spam and my
> project is to be able to identify those accounts as soon as they pop
> up.
>
> Unfortunately (I can't believe that I'm writing this) I am having a
> hard time getting spammers to actually spam me. Is there any way
> that
> I can somehow get access to the tweets of several dozen spam accounts
> (prior to when they're shut down) so that I can see what they're
> posting? Is this possible somehow?
>
> Also, if anyone gets spammed regularly, are you interested in helping
> me out with my research? No guarantee that I'll actually publish
> this, but anyone interested will be credited in my paper in the
> acknowledgements. Thanks
> -Jeff Tucker
> Lecturer, DigiPen Institute of Technology
> www.digipen.edu

I don't know how rapidly Twitter detects and shuts spam accounts down
these days. I imagine there's a priority scheme, with accounts linking
to malware and pr0n shut down more aggressively than those that are just
"selling stuff" and being annoying about it. Here's a bit of pseudo-code
that will get you one class of spammers:

1. Poll the Trending Topics periodically. IIRC if you do it every ten
minutes for all the localities you won't use up all your API calls per
hour.

2. Do a search for each trending topic - take the first 100 tweets for
each. This doesn't cost you any API calls, since it's a search.

3. Now use a relational database to find tweets that match more than
one trending topic. There's a high probability those are spam. Quite a
few of the other tweets will be spam too, but those that match multiple
trends are much more likely to be spam.

4. Now you have a list of accounts - pull their most recent 3200 tweets
and test your algorithm. You'll probably have to manually go through
them to find the boundary where the account started spamming, but then
you should have a nice dataset for a classifier training.


--
http://twitter.com/znmeb http://borasky-research.net

"A mathematician is a device for turning coffee into theorems." -- Paul
Erdős

--
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements
Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: http://groups.google.com/group/twitter-development-talk

No comments:

Post a Comment