Data Pipeline vs API: what’s the difference?

A question we get often about our new Parse.ly Data Pipeline is, “What’s the difference between raw data access provided by your Data Pipeline and the access provided by your HTTP/JSON API?”

This is a good question! This post will seek to answer it.

Parse.ly’s HTTP API, which has been live for several years, is implemented using RESTful design principles and simple JSON data formats. Programmers using our HTTP API can access aggregate Parse.ly data very rapidly, because HTTP clients exist in every programming language, as do JSON parsers — and HTTP/JSON requests can even be made inside web browsers and mobile apps themselves. For example, to power the “popular posts” module on this blog, we make a single call to a URL in Parse.ly’s API which returns data describing the most popular posts on our blog in the last 7 days using the JSON format. For example, at the time of this writing, here is what this call returns (with some details elided):

{
"success": true,
"data": [{
"_hits": 921,
"title": "PyKafka: Fast, Pythonic Kafka, at Last!",
"url": "https://parsely.com/post/3886/pykafka-now/"
}, {
"_hits": 497,
"title": "Lucene: The Good Parts",
"url": "https://parsely.com/post/1691/lucene/"
}, {
"_hits": 320,
"title": "Measuring the Impact of The John Oliver Effect",
"url": "https://parsely.com/post/2380/measuring-the-impact-of-the-john-oliver-effect/"
}, ...]
}

A single HTTP request, which could be implemented in a single line of code, summarizes traffic on the entire Parse.ly blog — traffic that Parse.ly tracked as thousands of individual raw events. It also fused this information, automatically, with properties of the articles themselves — like the title, author, tags. We also have an extensive recommendation API (see our in-depth posts on this: part 1 & part 2) which simplifies the implementation of on-site recommendation engines using this same rich information about your content.

An example trending content module generated using Parse.ly’s fast and easy API.

The key thing to understand is that Parse.ly’s HTTP/JSON API summarizes information about traffic and content for the purposes of quickly providing a way to provide on-site content links; traffic snapshots for quick data exports; or simple integrations of Parse.ly traffic data with existing content management systems. The API is not a full-fidelity accounting of all your analytics data; it’s instead an integration tool for rapidly infusing existing sites, apps, and reports with Parse.ly data — including our real-time traffic data.

Parse.ly’s Data Pipeline is more flexible

The Parse.ly Data Pipeline is something very, very different. It’s not just an API. It’s the ultimate API. It’s a rich way to unlock 100% of the data behind Parse.ly’s analytics, and analyze it for your own organization’s needs.

It does not use HTTP (though it still does use JSON), and it is slightly more involved to integrate, but it opens up a whole wealth of opportunities, including in-house analytics, data science, advertising/subscriber integrations, and more.

An example raw data record looks something like this (also simplified):

{
"action": "pageview",
"apikey": "parsely.com",
"referrer": "https://t.co/tsvV5QJ7YM",
"ts_action": "2016-04-30 17:45:03",
"url": "...parsely.com/post/3886/pykafka-now/",
"ua": "... Chrome/49.0.2623.105 Mobile Safari/537.36",
"visitor_site_id": "dec4bd83-286c-4f5c-9b67-5f8070759e3d"
}

This record represents a single pageview event. Thousands of records like this — along with thousands more representing engaged time — represent the “complete traffic record” for a single day on Parse.ly’s blog. Our full schema is documented in our raw data documentation.

Hopefully you can see how querying for this data via the old endpoints would be cumbersome indeed. Querying 7 days of raw data in this way would result in tens of thousands of raw events, many megabytes of data. And our Parse.ly blog doesn’t get that much traffic — if you run a high-traffic website, like Mashable.com, a single day of raw data ends up being gigabytes — or even tens or hundreds of gigabytes — of raw data. This becomes completely impractical to query “on-demand”.

Instead, the data needs to be delivered, not queried. After Parse.ly’s distributed data collection infrastructure ensures every raw event is captured, our Data Pipeline infrastructure ensures those raw events are validated, enriched, and streamed — directly to you. You finally have a way to access your analytics data in bulk and also access real-time events as they arrive.

As discussed in my last post, “Raw Data: The Ultimate API”, having access to the raw data opens up many use cases, including: cloud SQL via Redshift/BigQuery; open source data analysis stacks like Python/R/Spark; and direct integrations with product, ad/subscriber systems, and more.

The Data Pipeline is typically used for building in-house analytics. Showcased here is a custom Looker dashboard built atop an Amazon Redshift database storing Parse.ly raw events.

Most of these use cases would be limited or impossible with Parse.ly’s HTTP API, but they are made easy by our Data Pipeline.

Get in Touch

Whether you need help with Parse.ly’s API or Data Pipeline, all you need to do is get in touch with our team.

If you are already a Parse.ly customer, get in touch with us, and we’ll be happy to consult you on advanced use cases for your raw data.
If you are not a Parse.ly customer, we are glad to schedule a demo where we can share some of the awesome things our existing customers have done with our API and Data Pipeline.