To accomplish this goal, it’s helpful to understand the data scientist’s role in big data. Currently, big data is a melting pot of distributed data architectures and tools like Hadoop, NoSQL, Hive and R. In this highly technical environment, data scientists serve as the gatekeepers and mediators between these systems and the people who run the business – the domain experts.
While difficult to generalize, there are three main roles served by the data scientist: data architecture, machine learning, and analytics. While these roles are important, the fact is that not every company actually needs a highly specialized data team of the sort you’d find at Google or Facebook. The solution then lies in creating fit-to-purpose products and solutions that abstract away as much of the technical complexity as possible, so that the power of big data can be put into the hands of business users.
By way of example, think back to the web content management revolution at the turn of the century. Websites were all the rage, but the domain experts were continually banging their heads against the wall – we had an IT bottleneck. Every new piece of content had to be scheduled and sometimes hard-coded by the IT elite. So how was it resolved? We generalized and abstracted the basic needs into web content management systems and made them easy for non-techies to use. As long as you didn’t need anything too crazy, the problem was solved easily, and the bottleneck averted.
Let’s dig a little deeper into the three main roles of today’s data scientist, using online commerce as a backdrop.
The key to reducing complexity is to limit scope. Nearly every ecommerce business is interested in capturing user behavior – engagements, purchases, offline transactions and social data – and almost every one of them has a catalog and customer profiles.
Limiting scope to this basic functionality would allow us to create templates for the standard data inputs, making both data capture and connecting the pipes much simpler. We’d also need to find meaningful ways to package the different data architectures and tools, which currently include Hadoop, Hbase, Hive, Pig, Cassandra and Mahout. These packages should be fit for purpose. It comes down to the 80/20 rule: 80 percent of big data use cases (which is all most ecommerce businesses need), can be achieved with 20 percent of the effort and technology.