girtby.net » twitter http://girtby.net this blog is girtby.net Thu, 17 Sep 2009 14:27:44 +0000 http://wordpress.org/?v=2.9-rare en hourly 1 Archiving Tweets http://girtby.net/archives/2009/08/23/archiving-tweets/ http://girtby.net/archives/2009/08/23/archiving-tweets/#comments Sun, 23 Aug 2009 12:14:05 +0000 alastair http://girtby.net/?p=3905 If you’ve run Damon Cortesi’s handy curl command to download all (or the last 3200) tweets from your twitter account, you’ll have a directory full of files with names like user_timeline.xml?count=100&page=1. Not only that but they include a large amount of redundant profile stuff in the <user> element. And not only that, but twitter sometimes returns a “Twitter is over capacity” page instead of your tweets.

What we want to do is a) detect any files which don’t contain tweets, b) remove the redundant user profile, and c) combine the results into a single file.

Well, friends, here is a shell script to do exactly that. You’ll need zsh and xsltproc, both of which are standard on MacOS X and most sane Linuxen.

zsh is needed to sort the input files in numeric, as opposed to lexicographic, order. If you know of a way to do this in bash, let me know…

Output is on stdout, so just redirect to your filename of choice:

$ tweetcombine user_timeline.xml\?count=100\&page=* \
    > tweet_archive.xml

Here’s the script:

#!/bin/zsh

# Combine all of the twitter user_timeline.xml files specified on the command line into a single output
# Written by Alastair Rankine, http://girtby.net
# Licensed as Creative Commons BY-SA

input_args=()
for f in ${(on)*}; do
    [[ -f $f ]] || exit "Not a file: $f"
    input_args+="<input>${f//&/&amp;}</input>"
done

xsltproc - <<EOF
<?xml version="1.0"?>
<!DOCTYPE inputs [
  <!ATTLIST xsl:stylesheet id ID #REQUIRED>
]>
<?xml-stylesheet type="text/xml" href="#style1"?>
<inputs>
  ${input_args}

  <xsl:stylesheet id="style1" version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output type="xml" indent="yes"/>

    <xsl:template match="*">
      <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:apply-templates/>
      </xsl:copy>
    </xsl:template>

    <xsl:template match="statuses">
      <xsl:apply-templates/>
    </xsl:template>

    <xsl:template match="user"/>

    <xsl:template match="xsl:stylesheet"/>

    <xsl:template match="input">
      <xsl:choose>
        <xsl:when test="document(.)/statuses">
          <xsl:apply-templates select="document(.)"/>
        </xsl:when>
        <xsl:otherwise>
          <xsl:message terminate="yes"><xsl:value-of select="."/> does not contain statuses element</xsl:message>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

    <xsl:template match="inputs">
      <statuses type="array">
        <xsl:apply-templates/>
      </statuses>
    </xsl:template>

  </xsl:stylesheet>
</inputs>
EOF

I think this method of sticking filename arguments into an XSL document with an embedded stylesheet is quite a powerful way of processing XML documents with shell scripts. (Probably should put the <input> tags into a separate namespace though…)

]]>
http://girtby.net/archives/2009/08/23/archiving-tweets/feed/ 3
Twitter over IP http://girtby.net/archives/2008/06/04/twitter-over-ip/ http://girtby.net/archives/2008/06/04/twitter-over-ip/#comments Wed, 04 Jun 2008 10:43:00 +0000 alastair http://girtby.net/2008/06/04/twitter-over-ip Let’s solve Twitter’s scalability problems, shall we?

So, like most people, I don’t know much about the problems there and certainly don’t have any solutions to suggest. But I do know there are a certain class of solutions which aren’t on the table.

If you look at Twitter from a suitably high vantage point you see real-time communication between small groups. People entering short messages and having these messages appear at their peers a small time later. There’s also a central archive, but I’ve heard Twitter described as “public Instant-Messaging” and this seems to characterise it best for me.

In short, Twitter seems more suited to peer-to-peer communication than to client-server. What sort of protocol would it use? I can imagine a protocol which would be probably UDP-based, and which would send tweets to followers either directly from peers or perhaps through a local aggregation point. Large groups of followers could perhaps even use UDP multicast. Archive servers could be reached through network anycast addresses, to allow for greater decentralisation. IPv6 to get universal connectivity. And so on; fill in your own pet network technology here, there are certainly lots of potential solutions.

Instead of these, clients communicate directly with the Twitter servers using HTTP. Not only that, but they poll for updates. Bit of an architectural blunder, you might think. Well not really. In fact I don’t think the Twitter designers had any choice.

Once upon a time it was possible to deploy new application-layer protocols on the Internet. But those times have passed, it seems. These days, it’s HTTP(S) or nothing. And this is not the protocol you would choose for carrying tweets, if you had the choice. So the fact that twitter works at all over this sub-optimal application-layer protocol is quite an achievement.

This is a great example of the many ways in which innovation can be stifled by enforcing a lowest-common-denominator.

The impact is of course more widespread than just Twitter. In fact, the so-called end-to-end principle which was one of the fundamental founding principles of the Internet is now all but abandoned in practice. Geoff Huston examines the issue in some detail in a recent article, and it is highly recommended.

Of course, there are no easy answers, either for Twitter or the next application to suffer due to the proliferation of network middleware. But it’s certainly an issue that does need to be more prominent.

(This post is an obvious departure from my usual style of blatant attack pieces in order to score traffic and fame for myself. Normal service will resume shortly.)

]]>
http://girtby.net/archives/2008/06/04/twitter-over-ip/feed/ 8