Take the following tweet as an example
@shahmirj can be found on hello@example.com and @r2d2 but @000 dosent exist as a user
We need to be sure to match only @shahmirj
and @r2d2
and leave any thing that starts with a number or is an email addresses. To do this we use the following regex:
(?<=^|(?<=[^a-zA-Z0-9-_\\.]))@([A-Za-z]+[A-Za-z0-9_]+)
The best way to understand the regex above is to start at the right of @
, lets understand the meaning of the following @([A-Za-z]+[A-Za-z0-9]+)
We have to make sure that any thing we match starts with characters hence the [A-Za-z]+
but which can be followed by any numbers, therefore the followed expression [A-Za-z0-9]+
. This will make sure to match user-names such as @r2d2
. We cant leave things there because if you run this regex as it is, you will end up catching @example
which is not what we want. This is where the part previous to @
sign comes in.
Lets break (?<=^|(?<=[^a-zA-Z0-9-_\.]))
down. If we look at the inner right side of the bracket (?<=[^a-zA-Z0-9-_\.])
which makes sure that we don't catch any characters before the @
sign, So emails or tags such as aaa@bbb
are ignored. However if we only use this part then our first expression @shahmirj
disappears as it dosent start with any character, Therefore we use the expression before ?<=^
and combine it all together (?<=^|(?<=[^a-zA-Z0-9-_\.]))
which in plain English translates to match anything which either at the start or starts with a space.
We can now combine this into our PHP and change the twitter text to highlight the user-names and convert them into <a>
tags. An example of this can be seen at http://www.shahmirj.com/twitter
$string = "@shahmirj can be found on hello@example.com and @r2d2 but @000 dosent exist as a user";
$regex = "/(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)/i";
preg_replace($regex, "<a href='http://twitter.com/$1'>@$1</a>", $string);
Just a point of note that this can also be used for Hash Tags, just change the @
symbol to #
.
(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)
If any one has a better suggestion or something I missed please leave a comment (Now Working!)
Fixed the issue where it wasnt picking up tags such as @shahmirj_
, Needed to add _
in the end of the matching group
5th Feb 2012
Thanks @Slavisha, The site is simple and no Photoshop used :D
15th Nov 2012
It should be easy to create your own, by using the above and prepending some of the example in the following http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url
11th Jan 2016
@Mike It is pretty standard, there should be no difference in python. Let me know otherwise
5th Oct 2016
@Sreekanth, not sure please post if you find a solution
FYI, the comment about hashtags "Just a point of note that this can also be used for Hash Tags, just change the @ symbol to #.
(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)"
Using this python function below, the regex will not pickup hashtags that start with a number, like #2017Predictions
def extract_hashtags(s):
hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)", re.UNICODE)
return list(set(re.findall(hashtag_re, s.lower())))
And the regex for Twitter names will miss any that start with a number (ex, https://twitter.com/25071956 )
def extract_atmentions(s):
twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)')
return list(set(re.findall(twitter_username_re, s.lower())))
Python 2.7.6 transcript demonstrating hashtags starting with numbers and Twitter names starting with numbers:
>>> a="RT @PulpLibrarian : #2017predictions : we'll need better #antibiotics ... https://t.co/q0kINkzwDt"
>>> def extract_hashtags(s):
... hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z]+[A-Za-z0-9_]+)", re.UNICODE)
... return list(set(re.findall(hashtag_re, s.lower())))
...
>>> extract_hashtags(a)
['antibiotics']
>>>
>>> a="RT @PulpLibrarian : @25071956 #2017predictions : we'll need better #antibiotics ... https://t.co/q0kINkzwDt"
>>> def extract_atmentions(s):
... twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)')
... return list(set(re.findall(twitter_username_re, s.lower())))
...
>>> extract_atmentions(a)
['pulplibrarian']
>>>
An update found on StackExchange http://shahmirj.com/blog/extracting-twitter-usertags-using-regex, where you also posted info. Comment from by rokh Aug 16 '15 at 8:11 shares this
"@000 should be catched either as well as screen names with underscore (as mentioned by @backslash17 and @fixxxer). And it will not catch hashtags just by simply raplacing @ with #, since hashtags can contains unicode as well. So the expression for mentions should be (?<=^|(?<=[^a-zA-Z0-9-\.]))@([A-Za-z0-9_]+) – rokh Aug 16 '15 at 8:11"
So the modified Python functions would be (you'll have to put in the correct indentation, blog strips them out)
def extract_atmentions(s):
twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-\.]))@([A-Za-z0-9_]+)')
return list(set(re.findall(twitter_username_re, s.lower())))
def extract_hashtags(s):
hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z0-9_]+)", re.UNICODE)
return list(set(re.findall(hashtag_re, s.lower())))
Python transcript:
>>> a=" @25071956 : RT @CristinaDaRold : Too few #antibiotics in pipeline to tackle global #drug -resistance crisis, WHO warns https://t.co/KGNSBbdDHN Wed Sep 20 21:15:57 +0000 2017 #2017prediction #fun @_tgfb"
>>> def extract_hashtags(s):
... hashtag_re = re.compile("(?<=^|(?<=[^a-zA-Z0-9-\.]))#([A-Za-z0-9_]+)", re.UNICODE)
... return list(set(re.findall(hashtag_re, s.lower())))
...
>>> extract_hashtags(a)
['fun', 'antibiotics', '2017prediction', 'drug']
>>>
>>> def extract_atmentions(s):
... twitter_username_re = re.compile(r'(?<=^|(?<=[^a-zA-Z0-9-\.]))@([A-Za-z0-9_]+)')
... return list(set(re.findall(twitter_username_re, s.lower())))
...
>>> a
' @25071956 : RT @CristinaDaRold : Too few #antibiotics in pipeline to tackle global #drug -resistance crisis, WHO warns https://t.co/KGNSBbdDHN Wed Sep 20 21:15:57 +0000 2017 #2017prediction #fun @_tgfb'
>>> extract_atmentions(a)
['_tgfb', '25071956', 'cristinadarold']
>>>
this post still helped:) Understanding regex not easy when not used all the time.