• Top
  • Comment
  • Reply

Extracting Twitter Usertags using Regex

Take the following tweet as an example

@shahmirj can be found on hello@example.com and @r2d2 but @000 dosent exist as a user

We need to be sure to match only @shahmirj and @r2d2 and leave any thing that starts with a number or is an email addresses. To do this we use the following regex:

(?<=^|(?<=[^a-zA-Z0-9-_\\.]))@([A-Za-z]+[A-Za-z0-9_]+)

The best way to understand the regex above is to start at the right of @, lets understand the meaning of the following @([A-Za-z]+[A-Za-z0-9]+)

We have to make sure that any thing we match starts with characters hence the [A-Za-z]+ but which can be followed by any numbers, therefore the followed expression [A-Za-z0-9]+. This will make sure to match user-names such as @r2d2. We cant leave things there because if you run this regex as it is, you will end up catching @example which is not what we want. This is where the part previous to @ sign comes in.

Lets break (?<=^|(?<=[^a-zA-Z0-9-_\.])) down. If we look at the inner right side of the bracket (?<=[^a-zA-Z0-9-_\.]) which makes sure that we don't catch any characters before the @ sign, So emails or tags such as aaa@bbb are ignored. However if we only use this part then our first expression @shahmirj disappears as it dosent start with any character, Therefore we use the expression before ?<=^ and combine it all together (?<=^|(?<=[^a-zA-Z0-9-_\.])) which in plain English translates to match anything which either at the start or starts with a space.

We can now combine this into our PHP and change the twitter text to highlight the user-names and convert them into <a> tags. An example of this can be seen at http://www.shahmirj.com/twitter

$string = "@shahmirj can be found on hello@example.com and @r2d2 but @000 dosent exist as a user";
$regex = "/(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)/i";

preg_replace($regex, "<a href='http://twitter.com/$1'>@$1</a>", $string);

Just a point of note that this can also be used for Hash Tags, just change the @ symbol to #.

(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)

If any one has a better suggestion or something I missed please leave a comment (Now Working!)

UPDATE

Fixed the issue where it wasnt picking up tags such as @shahmirj_, Needed to add _ in the end of the matching group

By

18th Jun 2011
© 2011 Shahmir Javaid - http://shahmirj.com/blog/17

Slavisha

2nd Feb 2012

Hey man, this is awesome. Wonderful explanation!

BTW your website rocks as well.

Shahmir Javaid

5th Feb 2012

Thanks @Slavisha, The site is simple and no Photoshop used :D

Jenny

15th Nov 2012

hi Shahmir

also can you add how to use regex for getting twitter username for any url

likt

http://twitter.com/#username
https://twitter.com/username
http://www.twitter.com/@username

Shahmir Javaid

15th Nov 2012

It should be easy to create your own, by using the above and prepending some of the example in the following http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url

Mike Stoddart

12th Feb 2014

Any chance you could port this to Python? I can't seem to get the right syntax..

Arash Vahabpour

11th Jan 2016

there are some replies like .@arash, you might consider them as well

Shahmir Javaid

11th Jan 2016

@Mike It is pretty standard, there should be no difference in python. Let me know otherwise

Sreekanth Palaparty

5th Oct 2016

@Shamir look behind is not supported in javascript any alternative?

Shahmir Javaid

5th Oct 2016

@Sreekanth, not sure please post if you find a solution

John Eddy

21st Jul 2017

fyi: "The best way to understand the regex above is to start at the right of @, lets understand the meaning of the following @([A-Za-z]+[A-Za-z0-9]+)" is missing the _

john blue

22nd Oct 2017

FYI, the comment about hashtags "Just a point of note that this can also be used for Hash Tags, just change the @ symbol to #.
(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)"

Using this python function below, the regex will not pickup hashtags that start with a number, like #2017Predictions

def extract_hashtags(s):
hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)", re.UNICODE)
return list(set(re.findall(hashtag_re, s.lower())))

And the regex for Twitter names will miss any that start with a number (ex, https://twitter.com/25071956 )

def extract_atmentions(s):
twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)')
return list(set(re.findall(twitter_username_re, s.lower())))


Python 2.7.6 transcript demonstrating hashtags starting with numbers and Twitter names starting with numbers:

>>> a="RT @PulpLibrarian : #2017predictions : we'll need better #antibiotics ... https://t.co/q0kINkzwDt"
>>> def extract_hashtags(s):
... hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)", re.UNICODE)
... return list(set(re.findall(hashtag_re, s.lower())))
...
>>> extract_hashtags(a)
['antibiotics']
>>>


>>> a="RT @PulpLibrarian : @25071956 #2017predictions : we'll need better #antibiotics ... https://t.co/q0kINkzwDt"
>>> def extract_atmentions(s):
... twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)')
... return list(set(re.findall(twitter_username_re, s.lower())))
...
>>> extract_atmentions(a)
['pulplibrarian']
>>>

john blue

22nd Oct 2017

An update found on StackExchange http://shahmirj.com/blog/extracting-twitter-usertags-using-regex, where you also posted info. Comment from by rokh Aug 16 '15 at 8:11 shares this

"@000 should be catched either as well as screen names with underscore (as mentioned by @backslash17 and @fixxxer). And it will not catch hashtags just by simply raplacing @ with #, since hashtags can contains unicode as well. So the expression for mentions should be (?<=^|(?<=[^a-zA-Z0-9-\.]))@([A-Za-z0-9_]+) – rokh Aug 16 '15 at 8:11"

So the modified Python functions would be (you'll have to put in the correct indentation, blog strips them out)

def extract_atmentions(s):
twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-\.]))@([A-Za-z0-9_]+)')
return list(set(re.findall(twitter_username_re, s.lower())))

def extract_hashtags(s):
hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z0-9_]+)", re.UNICODE)
return list(set(re.findall(hashtag_re, s.lower())))

Python transcript:

>>> a=" @25071956 : RT @CristinaDaRold : Too few #antibiotics in pipeline to tackle global #drug -resistance crisis, WHO warns https://t.co/KGNSBbdDHN Wed Sep 20 21:15:57 +0000 2017 #2017prediction #fun @_tgfb"
>>> def extract_hashtags(s):
... hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z0-9_]+)", re.UNICODE)
... return list(set(re.findall(hashtag_re, s.lower())))
...
>>> extract_hashtags(a)
['fun', 'antibiotics', '2017prediction', 'drug']
>>>
>>> def extract_atmentions(s):
... twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-\.]))@([A-Za-z0-9_]+)')
... return list(set(re.findall(twitter_username_re, s.lower())))
...
>>> a
' @25071956 : RT @CristinaDaRold : Too few #antibiotics in pipeline to tackle global #drug -resistance crisis, WHO warns https://t.co/KGNSBbdDHN Wed Sep 20 21:15:57 +0000 2017 #2017prediction #fun @_tgfb'
>>> extract_atmentions(a)
['_tgfb', '25071956', 'cristinadarold']
>>>

this post still helped:) Understanding regex not easy when not used all the time.



Back to Top
All content is © copyrighted, unless stated otherwise.
Subscribe, @shahmirj, Shahmir Javaid+