Big Data: Opportunities and challenges

Big Data: Opportunities and challenges

Big Data: Opportunities and challenges

Big Data: Opportunities and challenges

eBook

$2.99  $3.89 Save 23% Current price is $2.99, Original price is $3.89. You Save 23%.

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

Despite the hype around big data, there is no denying that its potential to benefit organisations, businesses and customers is enormous. The articles in this ebook aim to give practical guidance for all those who want to understand big data better and learn how to make the most of it. Topics range from big data analysis, mobile big data and managing unstructured data to technologies, governance and intellectual property and security issues surrounding big data.

Product Details

ISBN-13: 9781780172637
Publisher: BCS, The Chartered Institute for IT
Publication date: 04/07/2014
Sold by: Barnes & Noble
Format: eBook
Pages: 60
File size: 742 KB

Read an Excerpt

CHAPTER 1

WHERE ARE WE WITH BIG DATA?

Brian Runciman, Head of Editorial and Website Services at BCS, The Chartered Institute for IT, looks at what big data is all about.

INTRODUCTION

There have been many descriptions of big data of late – mostly metaphors or similes for 'big' (deluge, flood, explosion) – and not only is there a lot of talk about big data, there is also a lot of data. But what can we do with structured and unstructured data? Can we extract insights from it? Or is 'big data' just a marketing puff term?

There is absolutely no question that there is an awful lot more data around now than there was only a few years ago. IBM say that 'every day we create 2.5 quintillion bytes of data – so much that 90 per cent of the data in the world today has been created in the last two years alone'.

SOURCES

Social media platforms produce huge quantities of data, both from individual network profiles and the content that influencers and the less influential alike produce. Short form blogging, link-sharing, expert blog comments, user forums, 'likes' and more all contain potentially useful information.

There is also data produced through sheer activity, for example machine-generated content in the form of device log files, which could be characterised as the 'internet of things'. This would include output from such things as geo-tagging.

Yet more data can be mined from software-as-a-service and cloud applications – data that's already in the cloud but mostly divorced from internal enterprise data. Another large, but at this stage largely untapped, area is the data languishing in legacy systems, which include things like medical records and customer correspondence.

CAVEATS

A post from BCS's future blogger called into question some of the behind-the-scenes story: 'For the big data commercial advocates, there must be algorithms that can trawl the data and create outcomes better, that is to say more cost effectively, than traditional advertising. Where is the evidence that such algorithms exist? How will these algorithms be created and evaluated and improved upon if they do exist? One problem is that in a huge data set, there may be many spurious correlations, and the difference between causation and correlation is hard to prove.'

As we would perhaps expect, the likes of IBM say that big data goes beyond hype: 'While there is a lot of buzz about big data in the market, it isn't hype. Plenty of customers are seeing tangible ROI using IBM solutions to address their big data challenges.'

Big Blue go on to quote a 20 per cent decrease in patient mortality by analysing streaming patient data in the health care arena; a telco that enjoyed a 92 per cent decrease in processing time by analysing networking and call data; and a whopping 99 per cent improved accuracy in placing power generation resources by analysing 2.8 petabytes of untapped data for a utilities organisation.

TOOLS

To handle large data sets in times gone-by enterprises used relational databases and warehouses from proprietary suppliers. However, these just can't handle the volumes of data being produced. This has seen a trend towards some open source alternatives such as Hadoop, which Wikipedia defines as 'an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware.'

Wired recently reported on Cloudera – one of several companies that help build and use Hadoop applications – which is offering a Google-style search engine for Hadoop called, uninspiringly, Cloudera Search. Interestingly, Wired pointed to a recent Microsoft paper on whether customers really need to put all their data in Hadoop. It argued that 'most companies don't (have) data problems that justify the use of big clusters of servers. Even Yahoo and Facebook, two of the companies most associated with big data, are using clusters to solve problems that could actually be done on a single server.'

Despite that, interest is on the up and big organisations are taking advantage. A recent piece from The Sun Daily mentions that 'analyst firm International Data Corp projects the global big data technology and services market will grow at a compound annual growth rate of 31.7 per cent – about seven times the rate of the overall information and communications technology market'.

The same article reports further investment in the perceived future of big data with announcements by Dell, Intel Corporation and Revolution Analytics of the Big Data Innovation Centre in Singapore. The new centre brings together expertise from all three organisations to provide training programmes, proof-of-concept capabilities and solution development support on big data and predictive analytic innovations catering to the Asian market.

HOW AND WHEN

The 'when' of embracing any new technology is massively variable depending on your organisation's aims, business sector and so on. Some of the things that could affect your timing are neatly summed up by Redmond magazine in a recent article, simply by listing some of the possible motivators. They mention that you could utilise 'CRM [customer relationship management] systems and data feeds to tweets mentioning their organisations that can alert them to a sudden problem with a product'. If this kind of real-time feedback is of benefit, then dipping a toe into the deluge of the big data waters is best done sooner rather than later.

Another area mentioned is 'potential market opportunities spawned by an event' – not as business-critical as product feedback, but important in a time of global austerity. Redmond magazine also mentions things such as online and big-box retailers using big data to automate their supply chains on the fly and law enforcement agencies analysing huge amounts of data to thwart potential crime and terror attacks. The scope and motivations vary widely, but potential benefits are both long and short-term.

As to how to go about it, some of the tools are mentioned above, often oriented around Hadoop. Microsoft recently launched Windows Azure HDInsight and Redmond magazine also cited VMware's key application infrastructure and big data and analytics portfolio called Pivotal.

There's plenty to read about, as the following list shows.

CHAPTER 2

BIG DATA TECHNOLOGIES

Keith Gordon MBCS CITP, former Secretary of BCS Data Management Specialist Group and author of Principles of Data Management, looks at definitions of big data and the database models that have grown up around it.

Whether you live in an 'IT bubble' or not, it is very difficult nowadays to miss hearing of something called 'big data'. Many of the emails hitting my inbox go further and talk about 'big data technologies'. These fall into two camps: the technologies to store the data and the technologies required to analyse and make sense of the data.

So, what is big data? In an attempt to find out I attended a seminar put on by The Institution of Engineering and Technology (IET) in 2012. After listening to five speakers I was even more confused than I had been at the beginning of the day. Amongst the interpretations of the term 'big data' I heard on that day were:

• Making the vast quantities of data that is held by the government publically available – the 'Open Data' initiative. I am really not sure what 'big' means in this scenario!

• For a future project, storing in a 'hostile' environment with no readily available power supply, and then analysing in slow time large quantities of very structured data of limited complexity. Here 'big' means 'a lot of'.

• For a telecoms company, analysing data available about a person's previous web searches and tying that together with that person's current location so that, for instance, they can be pinged with an advert for a nearby Chinese restaurant if their searches have indicated they like Chinese food before they have walked past the restaurant. Here 'big' principally means 'very fast'.

• Trying to gain business intelligence for the mass of unstructured or semi-structured data an organisation has in its documents, emails and so on. Here 'big' equates to 'complex'.

So, although there is no commonly accepted definition of big data, we can say that it is data that can be defined by some combination of the following five characteristics:

Volume – Where the amount of data to be stored and analysed is large enough to require special considerations.

Variety – Where the data consists of multiple types of data, potentially from multiple sources; here we need to consider structured data held in tables or objects for which the metadata is well defined, semi-structured data held as documents or similar where the metadata is contained internally (for example XML documents) or unstructured data, which can be photographs, video or any other form of binary data.

Velocity – Where the data is produced at high rates and operating on 'stale' data is not valuable.

Value – Where the data has perceived or quantifiable benefit to the enterprise or organisation using it.

Veracity – Where the correctness of the data can be assessed.

Interestingly, I saw an article from The New York Times about a group that works for the council in New York. It was faced with the problem of finding the culprits who were polluting the sewers with old cooking fats. One department had details of where the sewers ran and where they were getting blocked, another department had maps of the city with details of all the restaurants and a third department had details of which restaurants had contracts with disposal companies for the removal of old cooking fats.

Putting this information together produced details of the restaurants that did not have disposal contracts, were close to the blockages and were, therefore, possible culprits. That was described as an application of big data, but there was no mention of any specific big data technologies. Was it just an application of common sense and good detective work?

THE TECHNOLOGIES

More recently, following the revelations from Edward Snowden, the American whistleblower, The Washington Post had an article explaining how the National Security Agency is able to store and analyse the massive quantities of data it is collecting about the telephone, text and online conversations that are going on around the world. This was put down to the arrival, within the last few years, of big data technologies.

However, it is not just government agencies that are interested in big data. Large dataintensive companies, such as Amazon and Google, are taking the lead in some of the developments of the technologies to handle big data.

Our beloved SQL databases, based on the relational model of data, do not scale easily to handle the growing quantities of structured data and have only limited facilities for handling semi-structured and unstructured data. There is, therefore, a need for alternative storage models for data.

Collectively, databases built around these alternative storage models have become known as NoSQL databases, where this can mean 'NotOnlySQL' or 'No,NeverSQL' depending on the alternative storage model being considered (or, indeed, your perception of SQL as a database language).

There are over 150 different NoSQL databases available on the market. They all achieve performance gains by doing away with some (or all) of the restrictions traditionally associated with conventional databases in exchange for scalability and distributed processing. The principal categories of NoSQL databases are key-value stores, document stores, extensible record (or wide-column) stores and graph databases, although there are many other types of NoSQL databases.

A key-value store is where the data can be stored in a schema-less way, with the 'keyvalue' relationship consisting of a key, normally a string, and a value, which is the actual data of interest. The value itself can be stored using a datatype of a programming language or as an object.

A document store is a key-value store where the values are specifically the native documents, such as Microsoft Office (MS Word, MS Excel and so on), PDF, XML or similar documents. Whilst every row in a table in an SQL database will have the same sequence of columns, each document could have data items that are completely different.

Like SQL databases, extensible record stores or wide column stores have 'tables' (called 'super column families'), which contain columns (called 'super columns'). However, each of the columns contains a mix of 'attributes', similar to key-value stores. The most common NoSQL databases, such as Hadoop, are extensible record stores.

Graph databases consist of interconnected elements with an undetermined number of interconnections and are used to store data representing concepts such as social relationships, public transport links, road maps or network topologies.

Storing the data is, of course, just part of the story. For the data to be of use it must be analysed, and for this a whole new range of sophisticated techniques is required, including machine learning, natural language processing, predictive modelling, neural networks and social network mapping. Sitting alongside these techniques are a complementary range of data visualisation tools.

Big data has always been with us, whether you consider it to be a volume issue, a variety issue, a velocity issue, a value issue or a veracity issue, or a combination of any of these. What is different is that we now have the technologies to store and analyse large quantities of structured, semi-structured and unstructured data.

For some this is technically challenging. Others see the emergence of big data technologies as a threat and the arrival of the true big brother society.

CHAPTER 3

BIG DATA = BIG GOVERNANCE?

Adam Davison MBCS CITP asks whether big data means big governance.

For the average undergraduate student in the 1980s, attempting to research a topic was a time-consuming and often frustrating experience. Some original research and data collection might have been possible, but to a great extent, research consisted of visit to a library to trawl through text books and periodicals.

Today the situation is very different. Huge volumes of data from which useful information can be derived is readily available – both in structured and in unstructured formats – and that volume is growing exponentially. Researchers have many options. They can still generate their own data, but they can also obtain original data from other sources or draw on the analysis of others. Most powerfully of all, they can combine these approaches, allowing the examination of correlations and differences. In addition to all this, researchers have powerful tools and technologies to analyse this data and present the results.

In the world of work the situation is similar, with huge potential for organisations to make truly informed management decisions. The day of the 'seat of the pants' management is generally believed to be on the way out, with future success for most organisations driven by two factors: what data you have or can obtain and how you use it.

However, in all this excitement, there is an aspect that is easy to overlook: governance. What structures and processes should organisations put in place to ensure that they can realise all these possibilities?

Equally importantly, how can the minefield of potential traps waiting to ensnare the unwary be avoided? Can organisations continue to address this area in the way they always have or is a whole new approach to governance needed in this new world of big data?

What is clear is that big data presents numerous challenges to the organisation, which can only be addressed by robust governance. Most of these challenges aren't entirely new, but the increasing emphasis on data and data modelling as the main driver of organisational decisions and competitive advantage means that getting the governance right is likely to become far more important than has been the case in the past.

QUESTIONS, QUESTIONS

To start with there is the question of the overall organisational vision for big data and who has the responsibility of setting it. What projects will be carried out with what priority? Also one has to consider practicalities – how will the management of organisational data be optimised?

Next we come to the critical question of quality. Garbage in, garbage out is an old adage and IT departments have been running data cleansing initiatives since time immemorial. Yet in the world of big data, is this enough? What about the role of the wider organisation, the people who really get the benefit from having good quality data?

(Continues…)


Excerpted from "Big Data"
by .
Copyright © 2014 BCS Learning & Development Ltd.
Excerpted by permission of BCS The Chartered Institute for IT.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Table of Contents

Preface
Where are we with big data?
Big data technologies 
Big data = big governance?
Maximising on big data
Mobility and big data – an interesting fusion
Big data analysis
Removing the obstacles to big data analytics
Managing unstructured data
Big data – risky business
Securing big data
Data, growth and innovation
The new architecture
Intellectual property in the era of big and open data
Big data, big hats
The commercial value of big data
Big data, big opportunities
From the B&N Reads Blog

Customer Reviews