Comment
Reply

Extracting Twitter Usertags using Regex

Subscribe, Tweet, G+, Digg It, Stumble

Take the following tweet as an example

@shahmirj can be found on hello@example.com and @r2d2 but @000 dosent exist as a user

We need to be sure to match only @shahmirj and @r2d2 and leave any thing that starts with a number or is an email addresses. To do this we use the following regex:

(?<=^|(?<=[^a-zA-Z0-9-_\\.]))@([A-Za-z]+[A-Za-z0-9_]+)

The best way to understand the regex above is to start at the right of @, lets understand the meaning of the following @([A-Za-z]+[A-Za-z0-9]+)

We have to make sure that any thing we match starts with characters hence the [A-Za-z]+ but which can be followed by any numbers, therefore the followed expression [A-Za-z0-9]+. This will make sure to match user-names such as @r2d2. We cant leave things there because if you run this regex as it is, you will end up catching @example which is not what we want. This is where the part previous to @ sign comes in.

Lets break (?<=^|(?<=[^a-zA-Z0-9-_\.])) down. If we look at the inner right side of the bracket (?<=[^a-zA-Z0-9-_\.]) which makes sure that we don't catch any characters before the @ sign, So emails or tags such as aaa@bbb are ignored. However if we only use this part then our first expression @shahmirj disappears as it dosent start with any character, Therefore we use the expression before ?<=^ and combine it all together (?<=^|(?<=[^a-zA-Z0-9-_\.])) which in plain English translates to match anything which either at the start or starts with a space.

We can now combine this into our PHP and change the twitter text to highlight the user-names and convert them into <a> tags. An example of this can be seen at http://www.shahmirj.com/twitter

$string = "@shahmirj can be found on hello@example.com and @r2d2 but @000 dosent exist as a user";
$regex = "/(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)/i";

preg_replace($regex, "<a href='http://twitter.com/$1'>@$1</a>", $string);

Just a point of note that this can also be used for Hash Tags, just change the @ symbol to #.

(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)

If any one has a better suggestion or something I missed please leave a comment (Now Working!)

UPDATE

Fixed the issue where it wasnt picking up tags such as @shahmirj_, Needed to add _ in the end of the matching group

By Shahmir Javaid

_{18th Jun 2011}

Zend Form - Making it Bend to your will

This is a small tutorial, that explains exactly how to bend zend form's to your will. Starting from very basics and building the results up. I want to show how I have found some cool things about zend form.

Pixel to LatLng Google Maps API v3

Here is a little snippet showing how to get a LatLng location of the clicked pixel, inside the `map` div container. Its a nightmare in v3 but I though I share it with the world.

© 2011 Shahmir Javaid - http://shahmirj.com/blog/17

Slavisha

2nd Feb 2012

Hey man, this is awesome. Wonderful explanation!

BTW your website rocks as well.

Shahmir Javaid

5th Feb 2012

Thanks @Slavisha, The site is simple and no Photoshop used :D

Jenny

15th Nov 2012

hi Shahmir

also can you add how to use regex for getting twitter username for any url

likt

http://twitter.com/#username
https://twitter.com/username
http://www.twitter.com/@username

Shahmir Javaid

15th Nov 2012

It should be easy to create your own, by using the above and prepending some of the example in the following http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url

Mike Stoddart

12th Feb 2014

Any chance you could port this to Python? I can't seem to get the right syntax..

Arash Vahabpour

11th Jan 2016

there are some replies like .@arash, you might consider them as well

Shahmir Javaid

11th Jan 2016

@Mike It is pretty standard, there should be no difference in python. Let me know otherwise

Sreekanth Palaparty

5th Oct 2016

@Shamir look behind is not supported in javascript any alternative?

Shahmir Javaid

5th Oct 2016

@Sreekanth, not sure please post if you find a solution

John Eddy

21st Jul 2017

fyi: "The best way to understand the regex above is to start at the right of @, lets understand the meaning of the following @([A-Za-z]+[A-Za-z0-9]+)" is missing the _

john blue

22nd Oct 2017

FYI, the comment about hashtags "Just a point of note that this can also be used for Hash Tags, just change the @ symbol to #.
(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)"

Using this python function below, the regex will not pickup hashtags that start with a number, like #2017Predictions

def extract_hashtags(s):
hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)", re.UNICODE)
return list(set(re.findall(hashtag_re, s.lower())))

And the regex for Twitter names will miss any that start with a number (ex, https://twitter.com/25071956 )

def extract_atmentions(s):
twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)')
return list(set(re.findall(twitter_username_re, s.lower())))

Python 2.7.6 transcript demonstrating hashtags starting with numbers and Twitter names starting with numbers:

>>> a="RT @PulpLibrarian : #2017predictions : we'll need better #antibiotics ... https://t.co/q0kINkzwDt"
>>> def extract_hashtags(s):
... hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)", re.UNICODE)
... return list(set(re.findall(hashtag_re, s.lower())))
...
>>> extract_hashtags(a)
['antibiotics']
>>>

>>> a="RT @PulpLibrarian : @25071956 #2017predictions : we'll need better #antibiotics ... https://t.co/q0kINkzwDt"
>>> def extract_atmentions(s):
... twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)')
... return list(set(re.findall(twitter_username_re, s.lower())))
...
>>> extract_atmentions(a)
['pulplibrarian']
>>>

john blue

22nd Oct 2017

An update found on StackExchange http://shahmirj.com/blog/extracting-twitter-usertags-using-regex, where you also posted info. Comment from by rokh Aug 16 '15 at 8:11 shares this

"@000 should be catched either as well as screen names with underscore (as mentioned by @backslash17 and @fixxxer). And it will not catch hashtags just by simply raplacing @ with #, since hashtags can contains unicode as well. So the expression for mentions should be (?<=^|(?<=[^a-zA-Z0-9-\.]))@([A-Za-z0-9_]+) – rokh Aug 16 '15 at 8:11"

So the modified Python functions would be (you'll have to put in the correct indentation, blog strips them out)

def extract_atmentions(s):
twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-\.]))@([A-Za-z0-9_]+)')
return list(set(re.findall(twitter_username_re, s.lower())))

def extract_hashtags(s):
hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z0-9_]+)", re.UNICODE)
return list(set(re.findall(hashtag_re, s.lower())))

Python transcript:

>>> a=" @25071956 : RT @CristinaDaRold : Too few #antibiotics in pipeline to tackle global #drug -resistance crisis, WHO warns https://t.co/KGNSBbdDHN Wed Sep 20 21:15:57 +0000 2017 #2017prediction #fun @_tgfb"
>>> def extract_hashtags(s):
... hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z0-9_]+)", re.UNICODE)
... return list(set(re.findall(hashtag_re, s.lower())))
...
>>> extract_hashtags(a)
['fun', 'antibiotics', '2017prediction', 'drug']
>>>
>>> def extract_atmentions(s):
... twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-\.]))@([A-Za-z0-9_]+)')
... return list(set(re.findall(twitter_username_re, s.lower())))
...
>>> a
' @25071956 : RT @CristinaDaRold : Too few #antibiotics in pipeline to tackle global #drug -resistance crisis, WHO warns https://t.co/KGNSBbdDHN Wed Sep 20 21:15:57 +0000 2017 #2017prediction #fun @_tgfb'
>>> extract_atmentions(a)
['_tgfb', '25071956', 'cristinadarold']
>>>

this post still helped:) Understanding regex not easy when not used all the time.

JUnit with libGDX using Gradle

11th Jul 2014
My Network Programming Cheat Sheet

8th Jan 2014
My attempt on threading

8th Dec 2013
Phone Tilt Notes, for RC Car Acceleration

20th Jun 2013
Raspberry Pi and Pololu Servo Controller using C

9th Jun 2013

Subscribe, @shahmirj, Shahmir Javaid+

Extracting Twitter Usertags using Regex

UPDATE

By Shahmir Javaid

Next

Zend Form - Making it Bend to your will

Previous

Pixel to LatLng Google Maps API v3

© 2011 Shahmir Javaid - http://shahmirj.com/blog/17

Slavisha

Shahmir Javaid

Jenny

Shahmir Javaid

Mike Stoddart

Arash Vahabpour

Shahmir Javaid

Sreekanth Palaparty

Shahmir Javaid

John Eddy

john blue

john blue

Not Logged In