Using R, JSON and Regex without leaving Management Studio in SQL Server 2017

SQL Server and Regular Expression may finally have settled their differences.

It is now possible to use Regular Expressions without leaving Management Studio.

I personally have wanted this functionality for a long time. Text-mining has always been a huge part of my job and regular expressions (in spite of the very true cartoon above) are a pretty useful tool. I’d sat at SQL Bits and been excited by the announcement that R and Python were coming to SQL Server via EXEC
sp_execute_external_script
.
When the time was right I would use R to write Regular Expressions in SQL.

The time became right this week which meant I had the joy of letting someone else get there  first. That someone was Jeffrey Yao and I used his excellent blog to get started. The blog really is good but it didn’t go as far as I wanted; namely it didn’t show how to extract multiple matches of, say, phone numbers.

Here is Mr Yao’s code with one adaptation that I have noted.

USE TestDB;
— assume this is our testing database

GO

IF
OBJECT_ID(‘dbo.tblRegEx’,
‘U’)
IS
NOT
NULL


DROP
TABLE dbo.tblRegEx;

CREATE
TABLE dbo.tblRegEx

(

id INT
IDENTITY,

a VARCHAR(300),

b VARCHAR(300)

);

GO

INSERT
INTO dbo.tblRegEx

(

a,

b

)

VALUES

(‘hello hello hello world’,
‘my phone number is 321-111-1111’),

(‘this this is is really fun’,
‘this number (604) 991-9111 is my cell phone; (704) 991-9111 is my other cell phone’),    –only Difference to Jeffrey Yao blog is the extra phone number

(‘no duplicate here’,
‘no phone number here either, just my license# 111 111 2222’),

(‘multiple blank lines

–this is 6th line’,
‘222 333-4444 is my cell phone’);

My goal was to get a list of phone numbers in a relational table that led back to tblRegex:

This was how I did it:

–usage example

DECLARE @Json NVARCHAR(MAX);

EXEC dbo.usp_Table_RegexExtract
@PKField =
‘id’,

@InputField =
‘b’,
–name of field with scrapable data

@Table =
‘dbo.tblRegex’,

@RegexPattern =
‘(\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b)’,

@json = @Json OUTPUT;

SELECT
*

FROM


OPENJSON(@Json)


WITH


(

MyID INT
‘strict $.LocalID’,

Input NVARCHAR(MAX)
‘$.InputFieldValue’,

OrderInd INT
‘$.OrderInd’,

ExtractedValue NVARCHAR(MAX)
‘$.FinalExtraction’


);

These three lines of code contain: R script, box-fresh SQL Server features like JSON and Regular Expressions and this, in the SQL Server world is pretty darn sexy. (The converse of this statement is also true.)

Let’s take a look under the hood.

CREATE
PROCEDURE dbo.usp_Table_RegexExtract

@PKField NVARCHAR(128),
–name of primary key field

@InputField NVARCHAR(128),
–name of field containing data to be scraped

@Table NVARCHAR(128),

@RegexPattern NVARCHAR(4000)
=
‘(\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b)’,

@json NVARCHAR(MAX)
=
NULL
OUTPUT

AS

/*

extracts values from a field based on the regex pattern provided and returns the [nested] output as JSON

*/

SET
NOCOUNT
ON;

BEGIN


DECLARE @SQLStatement NVARCHAR(4000)
=
N’SELECT #PKField as id, #InputField AS b from #Table’;


SELECT @SQLStatement =
REPLACE(@SQLStatement,
‘#PKField’, @PKField),

@SQLStatement =
REPLACE(@SQLStatement,
‘#InputField’, @InputField),

@SQLStatement =
REPLACE(@SQLStatement,
‘#Table’, @Table);


DECLARE @RStatement NVARCHAR(4000)

The beginning of the stored procedure simply takes in a table name, a field name and a primary key field; it then puts them together to form a simple select statement; for example: SELECT id, b FROM dbo.tblRegEx
; it also takes in a regular expression pattern, in this case the one for a US telephone number.

I used this SQL Statement to get the column (vector if you will) of data to be interrogated. The next part is where the magic happens:

DECLARE @RStatement NVARCHAR(4000)


=
N’

pattern = “‘
+ @RegexPattern


+
‘”;

m <- gregexpr(pattern, inData$b, ignore.case = T, perl = T); #NB ignore case

n <- rep(NA,length(inData$b));

n <- sapply(regmatches(inData$b, m), paste, collapse=”|”);    #NB sapply here I found @dizzy_pete wrote a good blog http://petewerner.blogspot.co.uk/2012/12/using-apply-sapply-lapply-in-r.html

inData$c <- n;

outData <- inData;

outData;’;

The variable @RStatement is being prepared specifically to take advantage of sp_execute_external_script and whilst the differences between Mr Yao and my scripts are small they are important (and actually took me some time to get to). The first is that, in order to get multiple matches we need to use gregexpr (with a g at the front). This leads to the next conundrum, how to deal with nested vectors. After a few false starts, I ended up using sapply and using a pipe to separate multiple entries (for more information please see the blog linked to in the code).

This was enough R code to go straight to execution but first I needed to build a table variable to hold the output:

DECLARE @RegExOut TABLE


(

LocalID INT
PRIMARY
KEY
NOT
NULL,

InputFieldValue NVARCHAR(MAX),

InitalExtraction NVARCHAR(MAX)


);


INSERT
INTO @RegExOut


(

LocalID,

InputFieldValue,

InitalExtraction


)


EXEC
sp_execute_external_script
@language =
N’R’,

@script = @RStatement,

@input_data_1 = @SQLStatement,

@input_data_1_name =
N’inData’,

@output_data_1_name =
N’outData’;

This meant that SQL had an object in which to store the regular expression matches. The next challenge was to return these to the user. That is to use our output variable.

I did this by invoking another long awaited SQL function STRING_SPLIT
, culminating the code with FOR
JSON
PATH

and assigning this to our output variable.


SELECT @json =


(


SELECT rxo.LocalID,

rxo.InputFieldValue,

rxo.InitalExtraction,


ROW_NUMBER()
OVER(PARTITION
BY rxo.LocalID ORDER
BY rxo.LocalID)
AS OrderInd,

extraction.Value
AS FinalExtraction


FROM @RegExOut rxo


CROSS
APPLY
STRING_SPLIT(rxo.InitalExtraction,
‘|’) extraction


WHERE extraction.Value
IS
NOT
NULL


FOR
JSON
PATH


);

And that was that. Now I just needed to use it, which I did like so:

–usage example

DECLARE @Json NVARCHAR(MAX);

EXEC dbo.usp_Table_RegexExtract
@PKField =
‘id’,

@InputField =
‘b’,
–name of field containing data to be scraped

@Table =
‘dbo.tblRegex’,

@RegexPattern =
‘(\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b)’,

@json = @Json OUTPUT;

SELECT
*

FROM


OPENJSON(@Json)


WITH


(

MyID INT
‘strict $.LocalID’,

Input NVARCHAR(MAX)
‘$.InputFieldValue’,

OrderInd INT
‘$.OrderInd’,

ExtractedValue NVARCHAR(MAX)
‘$.FinalExtraction’


);

SQL, JSON, R and Regex all in three lines. Hurrah!