We live in the era of the platform economy. Now, all the great hardware and software vendors have introduced their analytics platform, each one designed to solve a particular analytical need. Under these circumstances, one could be forgiven for thinking that this means machines now have the capability of understanding our data to the extent that we don’t have to worry about anything beyond the most superficial aspects of the information we have collected. Sadly, this is not true, and this blog will show you why now truly knowing your data is more important than ever.
On the Nature of Data:
Essentially there are now two methods of dealing with your data:
1. ‘Do it yourself’ i.e. use skilled analysts within your organisation to use a platform supported by a Vendor
2. ‘Hand the baby over to the nanny’, i.e. your data to external analysts and deal with all the complications that come with it.
Examples of what’s possible are everywhere. Want to do Machine Learning (ML) without a data scientist? Choose Microsoft’s Azure ML platform. Want to explore cognitive capabilities? IBM’s Watson can help. Need to clean and aggregate your data?; Alteryx provides an “intuitive workflow for data cleansing and blending.” Even Experian, known for dictating whether you’ll get your next mortgage, can help you “keep your data accurate and stop wasting your money.” It’s official Machine Learning as a Service (MLaaS) and Data Science as a Service (DSaaS) has become democratised and becoming rapidly commoditised. All that will matter is eventually how complicated a process you want to make for yourself and how much you are willing to pay. However, every decision on what do with your data will depend on what you know about it.
In these posts, I want to empower decision makers in handling their data science initiative. So in this blog, I’ll introduce the lynchpin of any successful data science endeavour – data itself. To do this, I’ll explain 5 fundamental concepts and define some important terms.
Fundamental Concepts 1: Structured or Unstructured
Data essentially falls into two categories:
Structured data – think Excel spreadsheets and relational databases, where data is highly organised, structured and predictable. It’s often human readable, and more readily made machine readable. Search engines and various programs display this data in fancy ways.
Unstructured data – everything else. It’s text, videos, images, and sound files that we clever humans can readily consume. With no predefined schema, it must be converted for machines to understand. The development of ML algorithms extracting information from images or videos, especially utilising Deep Learning methodologies, is still in its infancy. Many of the Deep Learning tools like Google’s Tensorflow debuted only in 2015. And scientists that can actually use the tech are few and far between.
Consider one particular case that came to us at Luxoft: Our client wanted to extract information from a large volume of .pdfs containing both typed and handwritten text, images, photos and diagrams, so needed to take advantage of data science tools. But to fulfil their requirements, new deep learning image and text recognition software would need to be created from scratch. The client didn’t see the value in this; the cheaper and more pragmatic option would be to employ savvy administrators to go through the documents manually. The administrators would convert the information into a simpler text format, so less complex data analytics methods could be applied. While this particular solution path deals with the immediate issue, the amount of .pdfs will increase as business increases. The appropriate analytics solution was not implemented and the one chosen isn’t scalable to their business needs. This will be a problem that will need to be addressed in time.
So, when considering what data-driven results you want, start thinking about format. You’ll get an idea of what resources you need to take advantage of it. Talk to your solutions provider or data expert. At Luxoft, we help our clients define their business objectives and ensure they are exploiting the right type of data to fulfil their needs.
Fundamental Concepts 2: The “Right Data” for the “Right Answers”
You have a business problem to solve, but do you have the data to solve it? Consider the following cases:
1. Trading Surveillance: Analysts are looking for what resembles fraudulent behaviour. In data science, this is “Anomaly Detection.” In order to detect anomalies, a dataset must represent what normal behaviour is, along with data points representing anomalous or fraudulent behaviour. And as anomalies are rare, often this leads to the aggregation of large complex data to inform the machine learning algorithm of what to look for.
2. Forecasting Seasonal Consumption Trends in Energy: While applicable in multiple sectors, I’m focusing on Energy. In order to predict whether someone is going to consume “_____” energy next week, next season, or next year, you need sufficient data that spans back far enough to attempt a prediction. This is because you need to identify “genuine” data trends versus “noise”. For example, a day-to-day variation in energy usage is irrelevant when considering larger seasonal variability. Even then, you may not have enough data for your prediction to be credible. For example, if a gas company gathered how customers used their heating during a specific year, they effectively have data for only one season. Are you sure the consumption of energy is going to be the same next winter? Are all winters really this cold, or are some unseasonably warm?
3. Cross-selling: This problem is everywhere. Selling the right product to the right customer means knowing them, their spending habits, income, where they live and the job they do. But very few organisations this information at hand, requiring further research – either via 3rd party solutions such as APIs, developing bots, buying information and even taking surveys. And often your data set won’t hold that 360-degree insight you need on your target demographic. So, how will you get it?
In order to get the ‘right insights’, a scientific mind-set and the right domain expertise are necessary. You as stakeholder add to solving that problem – after all, you know your business. Scientists and analysts at Luxoft can tell you what is missing when you know what you want, resulting in true collaboration. So, what is it that you want? What can you give us to work with?
Fundamental Concepts 3. Aggregation and Infrastructure
Once an organisation figures out what it wants to solve, it’s time to gather the desired data. At this point a veritable Pandora’s Box of issues is opened. Privacy, accessibility, governance and compliance are technological issues to consider, so keep these questions in mind: Do you have the right architecture? This is governed by the nature of your data and what you want to do with it. Do you want to keep your data on the premises? Or do you want to scale up your operations and work from the cloud, or even a hybrid? Where is your data stored right now and how can you get to it? Are you considering re-hauling everything, or are you aiming to take advantage of new technologies without compromising your legacy system? What type of analytics is required – batch processing, near-real-time, or real-time analytics?
All these questions need to be explored before implementing the large-scale production of your ML algorithm. But these are questions not readily answered by any chosen platform. To get these answers you’ll still need help from a data expert.
Fundamental Concept 4: Veracity
When people talk about Big Data and the four Vs: volume, variety, velocity and veracity, the last one doesn’t get much airplay. With large amounts of data from multiple sources, your analytics speed may not be an issue. Even so, you need to know you can trust your data. So, can you?
Error propagation is a fundamental piece for any scientist. Your predictions and output are affected by the compounding of errors within your data pipeline. A good data scientist can help qualify and quantify these sources. We at Luxoft assist in managing your expectations so you can make informed business decisions supported by data interpretations. Through helping identify sources of errors and how to quantify them, our analytics experts vastly improve the validity of your insights.
Fundamental Concept 5: “Resistence is futile!”
The most challenging issue I come across in Data Science is not the data, nor the science – it’s people. Managing the data, corresponding technology and analytics is easy. But often the bottleneck to implementation is human resistance. It can be cultural, where we contest the ‘we’ve always done it like that’ attitude. It can also be a fear of risk, since it could result in losing time and money. It can be a concern on how “new-fangled” machines will affect the workflow, aka “will people lose jobs?” And for many organisations, it can be a concern over “how can we do this and remain compliant and safe?”
The truth is, you shouldn’t worry about these things, as long as your solutions provider communicates effectively and upholds your expectations. The availability of experienced experts and engineers across multiple sectors at Luxoft is reassuring for our clients. We also keep the channels of communication open within our organisation and with our customers. Combined with our passion for what we do, this makes for an easy relationship with our clients to ensure a smooth process.
Just like the previous blog, I’ve probably set you up with more questions than answers. But these are takeaway questions you can start thinking about without paying a fortune for a consultant to consider this as part of their billable hours.
It’s more important now than ever to know what you want to do with your data. Platforms are everywhere, and they enable you to create solutions. They are tools, so they won’t guide or advise you – that still requires empowering yourself. You need to know who and what to ask. Knowing your data as much as you know your business puts you on the path to success. Keep asking questions, keep striving to learn, and if you need somewhere to start, we’d be happy to help. Feel free to comment on the blog below or
In the next post, I will talk about how technology and tooling relates to the data science process.