That’s What He Said

Using Regular Expressions With Yahoo Pipes

Posted in Programming, Technology by wolfsbayne on January 11, 2009
yahoo pipes

yahoo pipes

Occasionally, you’ll want to manipulate the data you’re working with in Yahoo Pipes. Utilizing Regular Expressions can make the task fairly simple. Here are some great references I found while recently working on such a task. Enjoy! Oh, and if you find any errors, please let me know so I can update this document. **

++ Regular Expressions in Yahoo Pipes

* [ official Yahoo Pipes RegEx documentation]

+++ The basics

The RegEx module is one of the most powerful modules in Yahoo Pipes. You can do all kind of data transformations with it. This wiki page here would like to give you a short overview.

**Please note**: Like in the Yahoo Pipes discussions, I put RegEx patterns within square brackets. That way, you can distinguish for example [] and [ ] easily. Please omit the square brackets unless noted otherwise.

+++ The Modifiers

You might have noticed four checkboxes next to each RegEx line. Those are used for modifying the way the RegEx behaves and succeeded the so called “embedded pattern-match modifiers”. But they did not completely replace them.

In fact, you can use the modifiers “I, M, S and X” in embedded notation, while the checkboxes offer the options “I, M, S and G”. So there’s no X for checkboxes while there’s no G for the embedded notification.

The answer is taken from the [ Yahoo Pipes Discussions] and updated with information from [ RegEx documentation started].

**What they do**

* **g** allow global matches. set=match every occurence; unset=match only first occurence.
* **i** be case insensitive. set=’A’ equals ‘a’; unset ‘A’ and ‘a’ are treated differently
* **m** treat string as multiple lines. set=’^’ matches every start of string after a \\n and/or \\r . unset=’^’ matches only the very first character in the string.
* **s** allow ‘.’ to match new lines as well. set=’.’ matches ‘\\n’. unset=’.’ does not match ‘\\n’.
* **x** allow white spaces and comments within an expression.

**Embedded Notation**

[ Hapdaniel] from the Yahoo Pipes Discussions points out, the original form of specifying those flags is the “embedded notation”. If you prefix your RegEx with a (?x), you’ll set the X-modificator. You cannot set a (?g) that way, though.

**Checkbox Notation**

To activate one of the checkbox-flags, just tick it. You can tick as many flags as you like. Except the X-flag, which apparently is not available as checkbox.

+++ Common patterns

**Matching empty**

What, if you want to match “nothing”? [ Hapdaniel] has the solution:

* [^(?!.)]

**Matching not empty**

And here’s the opposite, again from the suggestion thread.

* [^(?=.)]

**Removing whitespace**

Sometimes, you’d like to remove all the linefeeds and unwanted spaces out of a field. I usually use a three- to fourfold approach to that. For each of the following replacements, use +g (the global flag)

# replace [\\n] (line feed) with [ ]
# replace [\\r] (carriage return) with [ ]
# (as needed) replace all [
] (html break) with [ ]
# replace [\\s+] (all whitespace occurrences) with [ ]

With 1 and 2, you remove all hard linefeeds. With 3, you remove all “logical” linefeeds (the ones that only get rendered, when the field is interpreted as html). with 4, you make the result more compact. If for example you have 3 or more spaces in a row, those will be reduced to just one space.

**Using reserved characters**

In RegEx, some characters are “reserved”. That means, they are not used literally, but instead used as functions. Examples:

* [.] — one arbitrary character. if +s flag is set, this includes the new-line character (\\n). if +s flag is unset, the dot does not include the new-line character.
* [\\d] — one digit. (0..9)
* [\\n] — new line, like in C
* [\\r] — carriage return, like in C
* [\\s] — one space character. Includes ‘ ‘ and tabs (\\t)

* [^] — beginning of string. If +m flag is set, this matches every start of a line. a line is then defined as something at the very start of the string or something after a new line (‘\\n’). If +m flag is unset, this matches only the very first character of the string.

* [$] — end of string. If +m flag is set, this matches every end of a line. if +m flag is unset, this matches only the very last character of the string.

* [()] — groups. You can use the groups matched in the replacement field. For example replace [(\\d)] with [0$1] results in a leading zero added.
* [[]] — character groups. For example, [123] matches 1, 2 or 3.

* [!\\d] — combination. ! means not, \\d means digit. So one character, being everything but a digit, is matched here
* [\\d*] — ‘*’ means: 0 to n matches. This would match no or up to infinite digits.
* [\\d+] — ‘+’ means: 1 to n matches. At least one. This would match one or more digits.

To “escape” reserved characters, that is to match them literally, you put a backslash in front. For example, matching [ (twitter)] is possible by using \\(twitter\\).

**Removing html tags**

From a post in the [ Yahoo Pipes Discussion].

* [<[^>]*>] – please note that this translates to something like <[^>]*> . matches every term that’s within <>.
* [<.*?>] – similar to the first statement, but “lazy match”. Not as efficient.

**Showing Images**

From a post in the [ Yahoo Pipes Discussion]. Sometimes, one of your field contains just an image URL. You’d like to replace that URL with an image tag, so it is rendered as an image.

* Replace [(.*)] with [<img src=”$1″ alt=”” />]

**Prefixing something**

Sometimes, you’d like to add something in front of a field. For example, to add a “Yahoo: ” in front of every title, you could

* Replace [(^)] with [Yahoo: $1]

$1 matches the first group used (we have only one group in this example). And ^ matches the beginning of the expression.

Source: []RegEx documentation started]

**Postfixing something**

And to suffix something, you’d use a $ instead of the ^.

* Replace [($)] with [Yahoo: $1]

Source: [ RegEx documentation started]

**Translating dates**

What, if you want to change a date of format mm/dd/yy to the ISO equivalent of yyyy-mm-dd ? You could use an expression like this one:

* Replace [(\\d\\d)\\/(\\d\\d)\\/(\\d\\d)] with [20$3-$1-$2]

Here, we have three groups. In the result, I also prefix a “20” as the year was specified only with two digits.

**Convert to Uppercase**

Also from a [ Yahoo Pipes Discussion]. You can use the \\U flag to convert something to uppercase. For example

* replace [(.*)] with [\\U$1]

**Convert to Lowercase**

No surprise here, you can use the \\L flag to convert something to lowercase

* replace [(.*)] with [\\L$1]

**from a text file online

I’m In The Twitter Timeline Again, But Now President Bush’s Speechwriter Isn’t #fail

Posted in Politics, Social, Technology by wolfsbayne on January 11, 2009

…as of 3 weeks ago.

I was a bad blogger and didn’t follow up after I posted to Twitter regarding being back in the timeline.

The short: I didn’t hear anything from the Twitter folks until I posted on, a public customer service website where twitter has a presence. Within a couple days I was miraculously back in the timeline. Kudos to services like GS. Here’s the link to the thread at GS:

When I was finally contacted by Twitter, I received an email from them. They didn’t reply to the thread at GS. Twitter claimed I was reported as a “spammer” and it took them over a month to “research” whether or not I was a spammer. So, either a.) it’s all B.S. and people in the Twitter San Francisco HQ are sympathetic to liberal causes, will censor you, and then will drag their heels to resolve an issue like this until you make your grievance public or b.) Twitter’s system is easily gamed, whereby if enough people block you, you get blacklisted and then Twitter’s support will…drag their heels to resolve an issue like this. I mean, we’re talking about people living and working in San Francisco. It’s not really a bastion of conservative thinking in NorCal is it? Sure, have a good laugh at the conservatives being censored, right?

What prompted me to post a followup today was the fact that I just received an @ reply on Twitter from Michael Johns (@michaeljohns), a former speechwriter for George H.W. Bush, in which Johns states that he’s missing from the timeline. This happened shortly after Micheal called out Barack Obama publicly on Twitter for lying. Here’s the link to all of Michael’s tweets in Twitter’s search index: You see, there are only a few tweets in the public timeline out of his 65 ( the # at the time of this writing) tweets.

I’m going to recommend that Michael piggyback on my thread at GS and post a description of his problem publicly. I’d encourage any other conservatives to do the same if they find themselves in the same situation of being censored, either by “bug”, happenstance, or intent.

“He who controls the news controls the views.”

Twitter, you need to get this corrected ASAP. It’s really starting to look suspect – even if it isn’t. #fail