Join the 30,000 people making better decisions with our data newsletter
Raw Data is the Ultimate Analytics API
Data layers, data lakes, data warehouses. It all feels a little overcomplicated, doesn’t it?
At Parse.ly, we like to ask ourselves, “how can we make data and analytics easy — rather than a chore?”
That’s what we’ve been doing for years with our awesome real-time content dashboards, which are now trusted by the staff of over 700 of the web’s highest-traffic sites, including TechCrunch and The Intercept. That’s what we’ve been doing with our powerful HTTP/JSON APIs for developers, which now get over 2.5 billion API calls per month, powering site experiences for the likes of Arstechnica and The New Yorker.
Parse.ly’s real-time and historical dashboards are trusted by over 170 companies to deliver live insights about over 475 million monthly unique visitors.
Similar to the world of analytics dashboards and analytics APIs a few years ago, the world of analytics warehousing and business intelligence today looks scary and unapproachable. But it doesn’t need to be that way.
Rather than backing down from the challenge, we decided to tackle it head-on. How could we design a modern and robust platform for unlocking the value of user and content interactions? It would need to be a platform built for the modern developer — centered around simplicity; compatibility with the public cloud; friendliness with open source tools; and, most of all, rock-solid reliability.
In short, how could we deliver you the raw data about your sites, apps, and users, with minimal fuss and maximum flexibility? After all, there’s an explosion of tools available for analyzing raw data and drawing insights — if only you could easily get your hands on that data. Could we provide the ideal data source for these tools?
Enter Parse.ly Data Pipeline. Building on the analytics infrastructure expertise we’ve developed by processing over 50 billion user events per month, Parse.ly is now making its fully-managed pipeline available as a service for developers. Finally, developers have real-time and historical access to raw data: the ultimate API.
Raw Data is the good stuff
It’s pure. It’s loaded with important information about your website and app visitors. It’s highly customizable. It’s not limited by one-size-fits-all dashboards or APIs. It’s a building block, a foundation.
Any good data scientist, analyst, or engineer knows that there is a ton of value in raw user interaction data. They can use that data to improve your business and delight your users. But for far too long, it’s been way too difficult to collect, enrich, transform, and store.
No longer: today, Parse.ly’s Data Pipeline makes Raw Data available to anyone, at the flip of a switch.
Parse.ly’s stream processor not only validates and cleans the data — it also enriches it. Our base schema has over 50 standard enriched attributes per event, all automatically inferred from lightweight tracking events. This includes geographic region, device type, and traffic source.
But here’s the best part: you own it all
Rather than trying to sell you a hosted data warehouse, or resell you Redshift or BigQuery, Parse.ly just delivers you the enriched data, and lets you do whatever you want with it. How cool is that?
- Durable storage and historical access is handled by your very own Amazon S3 Bucket. Data syncs into your S3 Bucket every 15 minutes, and is stored in a clean, gzip-compressed JSON format.
- Streaming events and real-time access is handled by your very own Amazon Kinesis Stream. Data flows into your Kinesis Stream in real-time, with end-to-end latency measured in seconds. Events are written in a matching JSON format.
Parse.ly provisions all this infrastructure automatically. You don’t even need to be an AWS user to make use of it — some of our pilot customers are using these managed endpoints from Google Cloud Platform or their own data centers!
Parse.ly ensures data is automatically and continuously written to the right place. So, within minutes, you can load the data up in the tools you love.
Leverage our official code samples — currently for CLI or Python — to get started. Or, clone our open source Python project on Github to get the data to convenient analytics engines like Redshift or BigQuery — or even quickly into your local development/analysis environment!
Raw Data unlocks queries and segments
Alright, so that developer experience sounds pretty sweet, but I bet you’re wondering — what can I do with this data? Glad you asked!
Parse.ly’s dashboard supports a slew of metrics out-of-the-box, such as views, visitors, time, and shares. However, our event analytics pipeline is generic, and can support custom events of all kinds. Interested in measuring scroll depth, on-site share actions, content recommendation clicks, viewable ad impressions, or newsletter subscriptions? Each of these can be modeled as a raw event, sent to Parse.ly, and delivered via the Data Pipeline.
What’s more, each event includes a wealth of information that Parse.ly does not make much use of inside its dashboards. This includes time-of-day information, detailed visitor geography (at city/postal code level), query parameters (for paid campaign tracking), user session information, and more. By having access to this data, your team can build custom analyses that make sense to your own business and to your users or visitors.
Raw Data allows for arbitrary SQL
Sometimes you have analyst teams who know how to write SQL and would love to ask questions against a subset of user data. SQL access is one of the primary drivers for our current Data Pipeline customers — even though we don’t directly provide it.
In the past, this was very hard to do since most SQL engines could not handle large-scale analytics data well, but this has changed with the public availability of Amazon Redshift, Google BigQuery, and other tools. Parse.ly’s Data Pipeline is specifically optimized to make it trivial to bulk load Parse.ly’s data into these tools — for example, our JSON formats are tested to work with automatic JSON conversion/import tools. Thus, with a couple of commands, you can not only fully own your analytics data — you can also ask any question of it!
For example, here’s a sample query one could write against an Amazon Redshift database that has been set up with Parse.ly’s raw data, without customizing our base event schema at all:
SELECT meta_title AS title, url, COUNT(action) as views FROM parsely.rawdata WHERE action = 'pageview' AND ts_action > current_date - interval '7' day GROUP BY url ORDER BY views
This would give back a table that might look like this:
This is showing the top posts (on this blog!) over the last 7 days. The benefit of a query like this is that you can customize the query however you like by customizing the SQL. For example, our dashboards and APIs have no notion of visitor geography, but our raw data includes an
ip_country field, describing the visitor’s country of origin. Filtering out all non-US traffic would be trivial; you’d just add:
WHERE ip_country = 'US'
to the SQL query. Done! That’s the power of SQL.
Raw Data enables Business Intelligence
Via their integration with the cloud SQL engines mentioned above, a number of Business Intelligence (BI) products are on the market that can assist you with building live-refreshing dashboards powered by SQL queries.
To kick things off, Parse.ly partnered with Looker, a SaaS BI platform that enables teams to run arbitrary queries and share live-refreshing dashboards with one another. Parse.ly and Looker share the philosophy of “analytics for everyone”, and we think it’s a great choice for working with Parse.ly raw data.
Pictured below is a Looker dashboard that a customer built atop our Data Pipeline, showing top-line view, visitor, and session counts, as well as pie charts breaking out devices, operating systems, and countries of origin. Existing LookML Data Apps and Blocks make this a snap.
An example Looker dashboard built from Parse.ly’s Data Pipeline. Customer receives the streaming data (via Kinesis/S3) and loads it into Amazon Redshift, while maintaining full control over ETL process and SQL schema. Looker queries the data from Redshift, using some standard column names and types that are common to Parse.ly raw events.
Raw Data accelerates Data Science
Data Science can be thought of as an intersection between programming and data analysis. One of the first problems a new data scientist encounters when working on any website is the lack of good, clean raw data about the company’s web and mobile interactions.
With Parse.ly’s Data Pipeline, this raw data is available to these teams immediately — no infrastructure build-out required. The formats and access patterns were specifically optimized to integrate with a wide range of open source data science tools, such as Spark, Python/Pandas, and R. Every single open source project that cares about data has an S3 connector and every single programming language has a fast JSON parser. So, with minimal work on your part, you’ll be able to load your data in and play around.
The open source data science ecosystem around Spark (pictured above) has exploded in the last couple of years, but before tools like Parse.ly Data Pipeline, it was hard to get access to clean clickstream data on your users, and a robust data collection infrastructure for new events.
In Short: Our Data Pipeline puts you in control
Parse.ly’s Data Pipeline means you get to take back your data. From the point solution vendors holding it hostage. From the crappy APIs rate limiting your calls. From your bosses who told you they don’t have time for a 6-month data pipeline infrastructure build-out.
This new product leverages all the work we’ve put into building an awesome existing data collection infrastructure, which has already been scaled to collect billions of events from billions of devices. And this includes raw, unsampled data from web browsers, mobile visitors, and elsewhere. That rock-solid foundation can now be yours.
The use cases for Parse.ly’s Data Pipeline are only limited by your imagination. Some of the projects customers have taken on with Parse.ly’s new offering include exploratory data science; product build-outs; new business intelligence stacks; customer journey analyses; visitor loyalty models; and executive dashboards.
In-house analytics doesn’t need to be a chore. We’re looking forward to hearing about all the awesome things you’ll build once you’ve got your own Data Pipeline for raw streaming analytics from your sites and apps.
Get in Touch
If you are already a Parse.ly customer, get in touch with us, and we’ll be happy to consult you on advanced use cases for your raw data.
If you are not a Parse.ly customer, we are glad to schedule a demo where we can share some of the awesome things our existing customers have done with this unlimited flexibility.