Despite the attention big data has received in the media and among the technology community, it is surprising that we are still shortchanging the full capabilities of what data can do for us. At times, we get caught up in the excitement of the technical challenge of processing big data and lose sight of the ultimate goal: to derive meaningful insights that can help us make informed decisions and take action to improve our businesses and our lives.
I recently spoke on the topic of automating content at the O’Reilly Strata Conference. It was interesting to see the various ways companies are attempting to make sense out of big data. Currently, the lion’s share of the attention is focused on ways to analyze and crunch data, but very little has been done to help communicate results of big data analysis. Data can be a very valuable asset if properly exploited. As I’ll describe, there are many interesting applications one can create with big data that can describe insights or even become monetizable products.
To date, the de facto format for representing big data has been visualizations. While visualizations are great for compacting a large amount of data into something that can be interpreted and understood, the problem is just that — visualizations still require interpretation. There were many sessions at Strata about how to create effective visualizations, but the reality is the quality of visualizations in the real world varies dramatically. Even for the visualizations that do make intuitive sense, they often require some expertise and knowledge of the underlying data. That means a large number of people who would be interested in the analysis won’t be able to gain anything useful from it because they don’t know how to interpret the information.
To be clear, I’m a big fan of visualizations, but they are not the end-all in data analysis. They should be considered just one tool in the big data toolbox. I think of data as the seeds for content, whereby data can ultimately be represented in a number of different formats depending on your requirements and target audiences. In essence, data are the seeds that can spout as large a content tree as your imagination will allow.
Below, I describe each limb of the content tree. The examples I cite are sports related because that’s what we’ve primarily focused on at my company, Automated Insights. But we’ve done very similar things in other content areas rich in big data, such as finance, real estate, traffic and several others. In each case, once we completed our analysis and targeted the type of content we wanted to create, we completely automated the future creation of the content.
These are bullets, headlines, and tweets of insights that can boil a huge dataset into very actionable bits of language. For example, here is a game notes article that was created automatically out of an NCAA basketball box score and historical stats.
Mobile and social content
We’ve done a lot of work creating content for mobile applications and various social networks. Last year, we auto-generated more than a half-million tweets. For example, here is the automated Twitter stream we maintain that covers UNC Basketball.
By metrics, I’m referring to the process of creating a single number that’s representative of a larger dataset. Metrics are shortcuts to boil data into something easier to understand. For instance, we’ve created metrics for various sports, such as a quarterback ranking system that’s based on player performance.
Instead of thinking of data as something you crunch and analyze days or weeks after it was created, there are opportunities to turn big data into real-time information that provides interested users with updates as soon as they occur. We have a real-time NCAA basketball scoreboard that updates with new scores.
This is one few people consider, but creating content-based applications is a great way to make use of and monetize data. For example, we created StatSmack, which is an app that allows sports fans to discover 10-20+ statistically based “slams” that enable them to talk trash about any team.
A variation on visualizations
Used in the right context, visualizations can be an invaluable tool for understanding a large dataset. The secret is combining bulleted text-based insights with the graphical visualization to allow them to work together to truly inform the user. For example, this page has a chart of win probability over the course of game seven of the 2011 World Series game. It shows the ebb and flow of the game.
As more people get their heads around how to crunch and analyze data, the issue of how to effectively communicate insights from that data will be a bigger concern. We are still in the very early stages of this capability, so expect a lot of innovation over the next few years related to automating the conversion of data to content.
Every company needs a mission, a strong vision that rallies the troops toward a common goal. When I started StatSheet back in 2007, my mission was to make sports stats more accessible and visually interesting on the web. We’ve come a long way since then, and it’s time for a new mission.
Here at Automated Insights, we don’t set small goals, we set big, hairy, audacious goals. So here’s our latest: Automate ESPN Digital. That’s right, we intend to automate all aspects of what ESPN does online. It’s a lofty goal, but we believe the time has come.
First, I’d like to give the folks at ESPN a lot of credit. They’ve revolutionized the sports world and have helped elevate sports like no other company before them. They gave us SportsCenter, 30 for 30, and Mel Kiper’s hair. Make no mistake, we are big ESPN fans. However, the time has come for disruptive innovation in the digital sports media space, and we are here to make it happen.
You might be asking yourself, “What does he mean? How do you Automate ESPN?” I took a tour of the ESPN facilities in Bristol a few years ago and was struck by the sheer amount of people they were throwing at the problem of covering sports. I was immediately taken by how much of what they did behind the scenes could be automated — everything from reporting to tweeting to video analysis to script writing.
The problem for ESPN is that computers can do many facets of their job faster and better. As Marc Andreessen famously said, “Software is eating the world,” and I firmly believe software will eat ESPN. It’s nibbling right now, but that’s how disruptive innovations start.
In the past 12 months, we’ve accomplished a lot toward this goal:
We generated over a half million sports stories last year through the StatSheet Network providing broader coverage than ESPN (it should be noted that they still recycle AP content).
We launched the world’s largest sports network of Android and iPhone mobile applications (http://statsheet.com/mobile) with individual automated apps for every team in the NFL, MLB, NCAA Basketball, and NCAA Football.
I’m giving a talk at the MIT Sloan Analytics Conference this year entitled “Automating Bill Simmons,” which will show just how far we think we can take our automated content platform.
And we’re just getting started. We also have some exciting new content applications that we’ll be launching soon that are automated versions of apps ESPN maintains manually as well as applications that would never be possible without automation.
There is still a long way to go. Automating access to all the great video coverage ESPN provides is not an easy feat. In an age where rights fees are exploding, that’s not something a startup can easily disrupt. Or is it? The second screen space might be the trojan horse that allows smaller players to reach users and grab eyeballs during live sporting events without incurring massive rights fees. 2012 will see a lot of movement in this area and we’ll have some solutions that help make second screen applications more interactive and engaging. (Quick plug: I’m speaking at the upcoming 2nd Screen Summit)
A couple weeks ago I gave a talk at the Sloan Sports Analytics Conference titled “Automating Bill Simmons” (which is based on a larger theme of Automating ESPN that I wrote about recently.) All of the talks are being posted to YouTube and mine is below. For those that aren’t familiar with what we do related to automating content, and perhaps more importantly why we do it, this will be a good overview.
In 2001, I got an itch to write a book. Like many people, I naïvely thought, “I have a book or two in me,” as if writing a book is as easy as putting pen to paper. It turns out to be very time consuming, and that’s after you’ve spent countless hours learning and researching and organizing your topic of choice. But I marched on and wrote or co-wrote 10 books in a five-year period. I’m a glutton for punishment.
My day job during that time was programming. I’ve been programming for 16 years. My whole career I’ve focused on automating the un-automatable — essentially making computers do things people never thought they could do. By the time I started on my 10th book, I got another kind of itch — I wanted to automate my writing career. I was getting bored with the tedium of writing books, and the money wasn’t that good.
But that’s absurd, right? How can a computer possibly write something coherent and informative, much less entertaining? The “how can a computer possibly do X?” questions are the ones I’ve spent my career trying to answer. So, I set out on a quest to create software that could write. It took more effort than writing 10 books put together, but after building a team of 12 people, we were able to use our software to generate more than 100,000 sports-related stories in a nine-month period.
Before I get into specifics with what our software produces, I think it’s worth highlighting some of the attributes that make software a great candidate to be a writer:
Software doesn’t get writer’s block, and it can work around the clock.
Software can’t unionize or file class-action lawsuits because we don’t pay enough (like many of the content farms have had to deal with).
Software doesn’t get bored and start wondering how to automate itself.
Software can be reprogrammed, refactored and improved — continuously.
Software can benefit from the input of multiple people. This is unlike traditional writing, which tends to be a solitary event (+1 if you count the editor).
Perhaps most importantly, software can access and analyze significantly more data than what a single person (or even a group of people) can do on their own.
Software isn’t a panacea, though. Not all content can be easily automated (yet). The type of content my company,Automated Insights, has automated is quantitatively oriented. That’s the trick. We’ve automated content by applying meaning to numbers, to data. Sports was the first category we tackled. Sports by their nature are very data heavy. By our internal estimates, 70% of all sports-related articles are analyzing numbers in one form or another.
Our technology combines a large database of structured data, a real-time feed of stats, and a large database of phrases, and algorithms to tie it all together to produce articles from two to eight paragraphs in length. The algorithms look for interesting patterns in the data to determine what to write about.
In November of 2010, we launched the StatSheet Network, a collection of 345 websites (one for every Division-I NCAA Basketball team) that were fully automated. Check out my favorite team: UNC Tar Heels.
We included the typical kind of stats you’d expect on a basketball site, but also embedded visualizations and our fully automated articles. We automated 14 different types of stories, everything from game recaps and previews to players of the week and historical retrospectives. Recently, we launched similar sites for every MLB team (check out the Detroit Tigers site), and soon we are launching sites for every NFL and NCAA Football team.
Sports is only one of many different categories we are working on. We’ve also done work in finance, real estate and a few other data-intensive industries. However, don’t limit your thinking on what’s possible. We get a steady stream of requests from non-obvious industries, such as pharmaceutical clinical trials and even domain name registrars. Any area that has large datasets where people are trying to derive meaning from the data are potential candidates for our technology.
Automation plus human, not automation versus human
Creating software that can write long-form narratives is very difficult, full of all sorts of interesting artificial intelligence, machine learning and natural language problems. But with the right mix of talent (and funding), we’ve been able to do it. It really does take a keen understanding of how software and the written word can work together.
I often hear it suggested that software-generated prose must be very bland and stilted. That’s only the case if the folks behind the software write bland and stilted prose. Software can be just as opinionated as any writer.
A common, and funny, question I get from journalists is: “when will you automate me out out of a job?” I find the question humorous because built into the question is the assumption that if our software can write the perfect story on a particular topic, then no one else should attempt to write about it. That’s just not going to happen. What’s happening instead is that media companies are using our software to help scale their businesses. Initially, that takes the form of generating stories on topics a media outlet didn’t have the resources to cover. In other cases, it means putting our stories through an editorial process that customizes the content to the specific needs of the publisher. You still need humans for that. There will be less of a need for folks to spend their time writing purely quantitative pieces, but that should be liberating. Now, they can focus on more qualitative, value-added commentary that humans are inherently good at. Quantitative stories can — and probably should — be mostly automated because computers are better at that.
Software will make hyperlocal content possible and even profitable. Many companies have tried to solve the “hyperlocal problem” with minimal success. It’s just too hard to scale content creation out to every town in the U.S. (or the world, for that matter). For certain categories (e.g. high school sports), software-generated content makes perfect sense. You’ll see automated content play a big role here in the coming years.
Because I’ve been so focused on running Automated Insights, I haven’t had time to write any new books recently. I suggested to a colleague that we should turn our software loose and have it write my next book. He looked at me and asked, “How can it possibly do that?” That’s what I like to hear.
But is a software-generated book even feasible? Our software can create eight paragraphs now, but is it possible to create eight chapters’ worth of content? The answer is “yes,” but not quite the same kind of technical books I used to write, at least right now. It would be easy for us to extend our technology to write even longer pieces. That’s not the issue. Our software is good at quantitative analysis using structured data.
The kind of books I used to write were not based on data and were qualitative in nature. I pulled from my experience and did supplemental research, made a judgment on the best way to perform a task, then documented it. We are in the early stages of building software that will do more qualitative analysis such as this, but that’s a much harder challenge. The main advantage of today’s usage of software writing is to automate repetitive types of content. This is less applicable for books.
In the near term, the writers at O’Reilly and elsewhere have nothing to worry about. But I wouldn’t count out automation in the long term.
We wanted to create iPhone apps for all of our sites in the StatSheet Network…all 345 of them. We talked to a bunch of iPhone users and the predominate response we got back is that when they are looking for an app, the first thing they do is go searching for it in the App Store. If you want to find a UNC Basketball app, you’d search on the term UNC Basketball. You don’t go search for a term like “College Basketball” and hope to find your team.
Also, having each team with its own app means we could customize the app to make it very specific to the team. The app icon could be our robot dressed up like the team’s mascot. The colors could be customized, etc. While you can do some of that with one app, it is definitely less than ideal.
So we went about creating 345 apps (or rather StatSheet’s Adam Rawlings did) and started submitting them. We got one approved, but after that a bunch got rejected (this took about 3 weeks). They told us we should use their “In-app purchase” feature to customize the app to a specific team, which obviously has all the drawbacks of a single app. So I escalated through Apple’s app review process. I laid out what I thought to be an articulate summary of why I thought the best user experience was to have an app per team.
Tonight I got a call from someone on the App Review team telling me my request has been denied. In fact, he said they’ve become much more strict about this issue over the last 6 weeks. The fundamental flaw in my logic was that I view the App Store as the primary discovery service for finding apps. This seems like a no-brainer to me. Where else do you go if you are trying to find iPhone apps?? Turns out Apples doesn’t view it that way. The Apple rep told me the App Store is NOT intended to help developers get awareness for their apps. It is simply a mechanism to facilitate the download/purchase of apps. He said they are not trying to be Google in terms of helping people find apps. He said that developers are responsible for driving awareness of their apps, not Apple.
Wow. I was a little blown away. Clearly, Apple also thinks differently about how to do things too. The problem in this case is that their mentality around the App Store is seriously flawed. Sure, if you are ESPN, you can drive awareness of your mobile apps. Heck, I’m sure people go to the App Store to search for “ESPN”. What about less well known companies, like maybe StatSheet? What sucks about Apple’s policy is now I have to create a single app to support every team, but I can’t even put every team’s name into the app description (which has a character limit) so that it is discoverable!
But what about all the clutter?!?! Having an app per team (or per insect as the Apple rep used in an example) would cause all sorts of “clutter” in the app store. This was a similar argument that I heard when I started submitting Chrome extensions last year for every team. STOP! You are going to clutter up the extension gallery! I’m sorry people, but in this day and age, do you really think putting artificial limits on what should be considered an app is really a smart move? It is better to deal with clutter up-front instead of trying to prevent it from ever happening (because it will anyway!) Focus on providing better search options. Use tagging more effectively. Create an App Rank formula. Just don’t tell me not to upload a few hundred apps because I’m going to clutter your digital store.
So we’ll create a single app, but the folks out there searching for “Duke basketball” or “UNC” won’t find it. But according to Apple, no one would do that kind of thing.
StatSheet is my first startup. I’ve been an advisor to several startups, but this is the first one I’m in charge of. So when I left Cisco and raised money, I thought about how I wanted to keep everyone in the loop with the latest news regarding the company.
In the office we have a daily stand-up meeting so everyone can keep up-to-date with projects. It has worked well. The other group of people that need updates are the investors. Think about it, if you invested tens of thousands or hundreds of thousands of dollars in a company, wouldn’t you want a periodic update?
I figured the easiest way to keep investors in the loop was to send out a weekly email (the StatSheet Statsheet). Apparently, not many entrepreneurs do this. With seed stage companies, there should be plenty of material to include in a weekly or bi-weekly email. At that stage things are changing so much (see my baby analogy) there are new developments just about every week. It only takes 30 minutes to write, and everyone on the distribution appears to appreciate it. It’s a great way to show your investors all the great progress you are making. Remember, the minute you close your first round, you immediately start the unspoken interview process for the second round (assuming you need to raise another round).