Source: Google

The World of “Big Data”

Dipaditya Das

--

To understand ‘Big Data’, you first need to know

What is data?

The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

Source: Google (Unsplash)

What is Big Data?

Source: Edureka

Big Data is a bundle of problems that arises due to the large volume of data. Big Data as defined by the “3Vs” but now there are “5Vs” of Big Data which are also termed as the characteristics of Big Data are the problems we face at the time of handling Big Data. They are as follows:-

Source: The Mis

1. Volume:

Volume is a huge amount of data. To determine the value of data, the size of data plays a very crucial role. If the volume of data is very large then it is considered as a ‘Big Data’. This means whether a particular data can be considered as a Big Data or not, is dependent upon the volume of data. Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.

Source: The Mis

2. Velocity:

Velocity refers to the high speed of accumulation of data. In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones, etc. There is a massive and continuous flow of data. This determines the potential of data that how fast the data is generated and processed to meet the demands.

You can see the speed of data being generated in real-time by visiting the website:

Click Here 👉InternetLiveStats

Source: InternetLiveStats.com
Source: The Mis

3. Variety:

It refers to the nature of data that is structured, semi-structured, and unstructured data. It also refers to heterogeneous sources. Variety is the arrival of data from new sources that are both inside and outside of an enterprise. It can be structured, semi-structured, and unstructured.

  • Structured data: This data is organized data. It generally refers to data that has defined the length and format of data.
  • Semi-Structured Data: This data is semi-organized. It is generally a form of data that does not conform to the formal structure of data. Log files are examples of this type of data.
  • Unstructured data: This data refers to unorganized data. It generally refers to data that doesn’t fit neatly into the traditional row and column structure of the relational database. Texts, pictures, videos, etc. are examples of unstructured data that can’t be stored in the form of rows and columns.

4. Veracity:

It refers to inconsistencies and uncertainty in data, that is available data can sometimes get messy and quality and accuracy are difficult to control. Big Data is also variable because of the multitude of data dimensions resulting from multiple disparate data types and sources.

5. Value:

After having the 4V’s into account there comes one more V which stands for Value!. The bulk of Data having no Value is of no good to the company, unless you turn it into something useful. Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information. Hence, you can state that Value! is the most important V of all the 5Vs.

Before you move on have a look at these amazing stats:-

🔹 International Data Corp. says by 2020, each person on earth will generate an average of about 1.7 MB of data per second.

🔹 IBM says worldwide, people are already generating 2.5 quintillion bytes of data each day.

🔹 International Data Corp. says advanced data analytics show that machine-generated data will grow to encompass more than 40% of internet data in 2020.

🔹 VisualCue says stored data will grow to 44 ZB by 2020.

🔹 BaseLine says nearly 90% of all data has been created in the last two years.

You might be wondering how the huge amount of data is being produced and causing problems?

People are generating 2.5 quintillion bytes of data each day. The amount of data we produce every day is truly mind-boggling. We are constantly producing data, even our kitchen appliances are hooked up to the internet, sharing and storing mountains of data. The amount of information being collected around the globe is too hefty to process. I have gathered some stats that will help illustrate some of the ways we create these colossal amounts of data every single day.

Internet

With so much information at our fingertips, we’re adding to the data stockpile every time we turn to our search engines for answers.

  • We conduct more than half of our web searches from a mobile phone now.
  • More than 7 billion humans use the internet (that’s a growth rate of 7.5 percent over 2016).
  • On average, Google now processes more than 40,000 searches EVERY second (3.5 billion searches per day)!
  • While 77% of searches are conducted on Google, it would be remiss not to remember other search engines are also contributing to our daily data generation. Worldwide there are 5 billion searches a day.

Social Media

Our current love affair with social media certainly fuels data creation. According to Domo’s Data Never Sleeps 5.0 report, these are numbers generated every minute of the day:

  • Snapchat users share 527,760 photos
  • More than 120 professionals join LinkedIn
  • Users watch 4,146,600 YouTube videos
  • 456,000 tweets are sent on Twitter
  • Instagram users post 46,740 photos
  • As of June 2019, Facebook reports an estimated 2.4 billion Monthly Active Users.
  • Facebook also says it has 1.6 billion Daily Active Users.
  • 88% of Facebook’s user activity is from a mobile device.
  • The average amount of time a single user spends on Facebook every day is 58 minutes.
  • There are over 300 million photos uploaded to Facebook every day.
  • On average, 5 Facebook accounts are created every second.
  • Approximately 30% of Facebook users are aged between 25 and 34 years.
  • Facebook video is still in high demand with approximately 8 billion video views per day.
  • Currently, YouTube has more than 1.9 billion logged-in visits every month.
  • 149 million people log in to YouTube daily.
  • The average duration of a YouTube visit is 40 minutes.
  • Viewers are spending an average of 1 hour per day watching YouTube videos.
  • On average, 300 hours of video are uploaded every minute on YouTube.
  • There are over 5 billion video views each day.
  • Instagram has over 1 billion monthly active users.
  • There are more than 600 million daily active users.
  • There are now 500 million daily Stories users.
  • Since its creation, more than 40 billion photos have been shared.
  • On average, 95 million photos are uploaded daily on Instagram.
  • There are approximately 4.2 billion likes per day.
  • Most Instagram users are between 18 to 29 years of age with 32% of Instagram users being college students.
  • Snapchat has approximately 301 million monthly active users.
  • Snapchat also reports 109 million daily active users (a downward trend).
  • Of those daily active users, 77 million are from the United States.
  • 60% of these Snapchat users are aged between 18 and 34 years.
  • Snapchat is competing closely with its rival, Facebook, by reporting more than 10 billion video views daily.
  • Approximately 3 billion snaps are created every day.
  • Nowadays Twitter has more than 330 million monthly active users.
  • There are 134 million daily active users or at least that’s how many “monetizable” daily active users (MDAU) according to Twitter.
  • Of their monthly active users, 68 million MAU forms the United States.
  • The number of mDAU from the US is 26 million.
  • Close to 460,000 new twitter accounts are registered every day.
  • Twitter users are posting 140 million tweets daily which adds up to a billion tweets in a week.
  • Each twitter user has on average 208 followers.
  • 550 million accounts are reported to have at least sent a tweet.

Communication

We leave a data trail when we use our favorite communication methods today from sending texts to emails. Here are some incredible stats for the volume of communication we send out every minute:

  • We send 16 million text messages
  • There are 990,000 Tinder swipes
  • 156 million emails are sent; worldwide it is expected that there will be 9 billion email users by 2019
  • 15,000 GIFs are sent via Facebook messenger
  • Every minute there are 103,447,520 spam emails sent
  • There are 154,200 calls on Skype

Digital Photos

Now that our smartphones are exemplary cameras as well, everyone is a photog and the trillions of photos stored is the proof. Since there are no signs of this slowing down, expect these digital photo numbers to continue to grow:

  • People will take 1.2 trillion photos by the end of 2017
  • There will be 4.7 trillion photos stored

Services

There are some really interesting statistics coming out of businesses and service providers in our new platform-driven economy. Here are just a few numbers that are generated every minute that piqued my interest:

  • The Weather Channel receives 18,055,556 forecast requests
  • Venmo processed $51,892 peer-to-peer transactions
  • Spotify adds 13 new songs
  • Uber riders take 45,788 trips!
  • There are 600 new page edits to Wikipedia

Internet of Things

The Internet of Things, connected “smart” devices that interact with each other and us while collecting all kinds of data, is exploding (from 2 billion devices in 2006 to a projected 200 billion by 2020) and is one of the primary drivers for our data vaults exploding as well.

Let’s take a look at just some of the stats and predictions for just one type of device, voice search:

  • There are 33 million voice-first devices in circulation
  • 8 million people use voice control each month
  • Voice search queries in Google for 2016 were up 35 times over 2008

Now, I think you might have got the idea from where such huge data comes. There are hundreds of companies like Facebook, Twitter, and LinkedIn generating yottabytes of data.

Why store such huge data?

To gain competitive advantage, organizations have to make the best use of the unstructured data collected for profitable business decision making. This situation where companies and institutions have to support, store, analyze, and make decisions using large amounts of data is called Big Data.

The process of analyzing large structured and unstructured data sets to discover indefinite relations, hidden patterns, and any other valuable information that can be leveraged for better business decision making. Big Data Analytics tackles even the most challenging business problems through high-performance analytics. Big data analytics drives innovations by helping organizations make the best possible decisions through –high-performance data mining, predictive analytics, text mining, social sentiment analysis, text mining, forecasting, and optimization. To add to this, organizations are realizing that distinct properties of deep learning and machine learning are well-suited to address their requirements in novel ways through big data analytics.

Who helps to store and process large datasets?

Big Data Hadoop is a software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.

In simple words, Hadoop helps us to apply the concept of distributed storage, where Big Data is spilled into different smaller blocks of data before storing it. This helps us in storing & retrieving data at high speed or velocity without a single point of failure as data is stored in different nodes.

Hadoop is built to run on a cluster of machines.

Let’s start with an example. Let’s say that we need to store lots of photos. We will start with a single disk. When we exceed a single disk, we may use a few disks stacked on a machine. When we max out all the disks on a single machine, we need to get a bunch of machines, each with a bunch of disks. This is exactly how Hadoop is built. Hadoop is designed to run on a cluster of machines from the get-go.

Hadoop clusters scale horizontally

More storage and compute power can be achieved by adding more nodes to a Hadoop cluster. This eliminates the need to buy more and more powerful and expensive hardware.

Hadoop can handle unstructured/semi-structured data

Hadoop doesn’t enforce a schema on the data it stores. It can handle arbitrary text and binary data. So Hadoop can digest any unstructured data easily.

Hadoop clusters provide storage and computing

We saw how having separate storage and processing clusters is not the best fit for big data. Hadoop clusters, however, provide storage and distributed computing all in one.

Big Data Tools

Data Storage and Management Tools

  • MongoDB
  • Cassandra
  • neo4j
  • Apache Hadoop
  • Apache HBASE
  • Microsoft HDInsight
  • Apache ZooKeeper

Data Cleaning

  • Microsoft Excel
  • Open Refine

Data Mining

  • TeraData
  • rapidminer

Data Visualization

  • Tableau
  • IBM Watson Analytics
  • Plotly

Data Reporting

  • Power BI

Data Ingestion

  • Sqoop
  • Flume
  • Apache STORM

Data Analysis

  • Apache HIVE
  • PIG
  • Apache Hadoop MapReduce
  • Apache Spark

Distributed Computing

  • Apache YARN

Hadoop Case Studies in the Enterprise

1. BT

BT uses a Cloudera enterprise data hub powered by Apache Hadoop to cut down on engineer call-outs. By analyzing the characteristics of its network, BT can identity whether slow internet speeds are caused by a network or customer issue. They can then evaluate whether an engineer would be likely to repair the problem. The Cloudera hub provides a unified view of customer data stored in a Hadoop environment. BT earned a return on investment of between 200 and 250 percent within one year of the deployment. BT has also used it to create new services such as “View My Engineer”, an SMS and email alerting system that lets customers track the location of engineers. The company now wants to use predictive analytics to improve vehicle maintenance.

2. CERN

The Large Hadron Collider in Switzerland is one of the largest and most powerful machines in the world. It is equipped with around 150 million sensors, producing a petabyte of data every second, and the data being delivered is growing all the time.
CERN researcher Manuel Martin Marquez said: “This data has been scaling in terms of amount and complexity, and the role we have is to serve to these scaleable requirements, so we run a Hadoop cluster.”
“From a simplistic manner, we run particles through machines and make them collide, and then we store and analyze that data.”
“By using Hadoop we limit the cost in hardware and complexity in maintenance.”

3. Tesla

Tesla is using a Hadoop cluster to collect the increasing amount of data being generated by its connected cars.

CIO Jay Vijayan said: “We are working on a big data platform… The car is connected, but it does not really talk to the network every minute because we want to keep it as smart and efficient as possible. It alerts us if the car is not functioning properly so service teams can take action.”

4. British Airways

British Airways deployed its first instance of Hadoop in April 2015, as a data archive for legal cases that were primarily stored, at a high cost, on its enterprise data warehouse (EDW) platform. Since deploying Hortonworks 2.2 HDP, Spanos said his department has returned on its investment within a year, and is able to deliver 75 percent more free space for new projects, which translates to cost reductions to the airline’s finance team. British Airways’ data exploitation manager Alan Spanos said: “In business intelligence, if you don’t adopt this technology to do at least part of your job role, you will not exist in a few years’ time. You can only go so far with traditional technology. It still has a place within your architecture, but quite frankly, this is where you need to be.”

5. Royal Bank Of Scotland

Royal Bank of Scotland (RBS) has been working with Silicon Valley company Trifacta to get its Hadoop data lake in order, so it can gain insight from the chat conversations its customers are having with the bank online. RBS stores approximately 250,000 chat logs plus associated metadata per month. The bank stores this unstructured data in Hadoop. However, before turning to Trifacta this was a huge and untapped source of information about its user base.

6. McKinsey

McKinsey projected that efficient usage of Big Data and Hadoop in healthcare industry can reduce the data warehousing expenses by $300-$500 billion globally. The data generated by electronic health devices is difficult to analyse using the traditional database management systems. Complexity and volume of the healthcare data is the primary driving force behind the transition from legacy systems to Hadoop in the healthcare industry. Using Hadoop on such scale of data helps in easy and quick data representation, database design, clinical decision analytics, data querying and fault tolerance.

7. Nokia

Another example of Big Data management in the telecom industry comes from Nokia. They store and analyse massive volume of data from their manufactured mobile phones. To paint a fair picture of Nokia’s Big Data, they manage 100 TB of structured data along with 500+ TB of semi-structured data. Hadoop Distributed Framework System provided by Cloudera manages all variety of Nokia’s data and processes it in a scale of petabytes.

8. JPMorgan Chase & Co.

Morgan Stanley with assets over 350 billion is one of the world’s biggest financial services organizations. It relies on the Hadoop framework to make industry critical investment decisions. Hadoop provides scalability and better results through it administrator and can manage petabytes of data which is not possible with traditional database systems.

JPMorgan Chase is another financial giant which provides services in more than 100 countries. Such large commercial banks can leverage big data analytics more effectively by using frameworks like Hadoop on massive volumes of structured and unstructured data. JPMorgan Chase has mentioned it on various channels that they prefer to use HDFS to support the exponentially growing size of data as well as for low latency processing of complex unstructured data.

9. The Weather Channel

TWC is before all a technology platform operator, which developed an extremely high-volume data platform, collecting and analyzing data from 3 billion weather forecast reference points, more than 40 million smartphones and 50,000 airplane flights per days, and serves 65 billion unique access a weather data a day.

TWC has found a big business in big data. It collects terabytes of data every day and uses it not only to predict the weather in millions of locations but also to predict what consumers in those locations will buy. TWC married more than 75 years’ worth of weather data with aggregated consumer purchasing data. For example, air-conditioner sales increases during hot weather, but folks in Atlanta suffer three days longer than people in Chicago before running out to buy one. Such analysis has created a whole new business for TWC — ‘Selling ads based on big data analytics’.

Final Thoughts…

BigData is the name of all the problems arises due to collection, storage, retrieval and many other operations in data.
Follow me in the Medium for more Big Data related articles and connect me in LinkedIn .

--

--

Dipaditya Das

IN ● MLOps Engineer ● Linux Administrator ● DevOps and Cloud Architect ● Kubernetes Administrator ● AWS Community Builder ● Google Cloud Facilitator ● Author