Your SlideShare is downloading. ×
Big data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Big data

1,551
views

Published on

Brief description and definition of Big Data, tools to use it, and the future of it.

Brief description and definition of Big Data, tools to use it, and the future of it.

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,551
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
182
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Big Data Taufiq Hail Ghilan Al-Madhagy 6/7/2013
  • 2. 1. Introduction The world is changing, and this is the digital ear. Almost everything around us is digitized and the flow of information is huge from variety of sources ranging from mobile phone, smart devices, surveillance, sensors of the universe, weather forecasting sensors, medical equipments, customers transactions of the internet, user behaviors on the internet, and so on. This creates huge amount of data that have the sizes of terabytes to petabytes and are on daily or weekly bases transactions. This data is called “Big Data” that provoked new researches on the area of information analysis, structuring, and visualizing. One of the most dominant successful methods to gain insight of this Big Data is Hadoop which the pioneers of data base management systems are adopting with some added tools to deal with it and to gain valuable information from this data to better understand it and consequently take proper actions and decisions based on this understanding. In the following essay we are going to dig further on the definition of Big Data, the types and the benefits of it, the challenges surrounds it, the techniques that are used so far to solve the challenges. Big Data is now becoming the talk not of town, but the talk of the IT market and scientists which needs not only few pages to cover but a PhD research to find better ways to solve the problem which is increasing day by day in the digitized era as more digitized devices infiltrate in our daily life. 2. Types of big data The Big Data has varying definitions; some define it as it is the greater volume of today’s data, the new types of data and analysis, or the emerging requirements for more real-time information analysis1 . Others argue that the “Big” is not a certain amount of data that could be predicted. The big nowadays may become tomorrow a small, but at the end we can say according to some majority of researchers that the amounts of data between Terabytes to Petabytes are considered as Big Data. Although this value may change over time to take bigger numbers. Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques [1]. The Big Data has variety of types that are classified as structured, unstructured data, text and multimedia [2]. The big data types could be social media, web and software logs, cameras pictures and other log info, information-sensing mobile devices, aerial sensory technologies, genomics, and medical records [6]. 1 IBM Executive Report, reference 2.
  • 3. 3. Benefits of big data Many companies are looking at the Big Data as a source for better understanding and a facility to predict the customer behaviors, and thus improve the customer experiences. The social media, the transactions of different sources banks and others, the syndicated data through sources such as loyalty cards, and other customer related information gave a valuable information for the companies to predict the customer’s preferences and needs, in other words, building a long term businesses and customer services for decades. By having this understanding, organizations of all types are finding new ways to connect with existing and potential customers. This approach applies to small and enterprises such as in telecommunications, healthcare, government, and banking and in business-to-business interactions among partners and suppliers. Many benefits of the Big Data include customer-centric objectives, and many functional objectives that are being addressed through early applications of big data. Operational optimization, risk and financial management, employee collaboration and enabling new business models are some of the benefits for both the customer himself and the producer. A released report on May 2011 by McKinsey says that leading companies are using big data analytics to gain competitive advantage. Those companies forecast that a 60% margin increase for retail companies who are able to harvest the power of big data [6]2 . Those companies perceived the importance of these huge amount of data and released that it is now the time to take advantage of it [6]. 4. Challenges in big data Doug Laney is the first who developed the model of the Big Data that is described by the “3 Vs”, namely the volume, velocity and variety. On the other hand, IBM added another V which is the veracity. Inclusion of veracity as the fourth big data attribute emphasizes the importance of addressing and managing for the uncertainty inherent within some types of data according to IBM [2]. I should here add that some researchers call the three “Vs” as the “3Ss”, where the first is the source, as the variety, the second is the speed, as the velocity, and the last is the size, which refers to the volume. In the following lines we are going to describe the three challenges of the Big Data. 1. Volume The huge amount of data that ranges between terabytes and petabytes is the main challenge that faces the Big Data. Volume refers to the mass quantities of data that organizations are trying to exploit to improve 2 A document from Oracle.
  • 4. decision-making across the enterprise. Data volumes continue to increase at an unprecedented rate according to [2]. The traditional hardware and the relational database processing are incapable of handling many tasks required by the Big Data. These tasks including modeling of climate on earth, predicting the weather forecast, receiving and analyzing huge amount of data collected in hospitals of patients, diagnosing diseases, gathering information from the galaxy, and so on. 2. Velocity The amount of data flowing in every day to any enterprise is exponentially increasing beyond the traditional Systems of storing and processing. Also the speed of the creation of data, processing and analyzing it continues to progress at very high speed and therefore the data is always in motion since its creation to it processing phase to the storage and retrieving phase [2]. It is also known that data streaming is becoming an essential of any internet activity for almost every user of the systems nowadays even in mobile devices such as mobile phone or tablets. Nowadays, data is continuously generated at a pace that is impossible for traditional systems to capture, store and analyze. The online video, location tracking using GPS, or augmented reality among many applications depends on large amounts of fast data streaming [1]. These services becoming a challenge for many organizations that needs to use new methods of delivering these services in which the conventional methods are not suitable. For time-sensitive processes such as real-time fraud detection or multi-channel “instant” marketing, certain types of data must be analyzed in real time to be used in business decisions that gives the value for the business to improve and elevate. We can say if there is velocity we should talk about the latency from which the data is created till is accessed and analyzed [2]. 3. Variety It is simply referred to the different types of data and data sources. The data that is stored and processed everyday has a variety of types. In the past the data that had to be processed were personal documents, financial transactions, stock records, and so on. In the present, we have audio, video, graphics, 3D models, location data and many complex data that needs to be stored, delivered, or processed. These unstructured Big Data are therefore not easy to categorize with traditional methods of dealing with huge amounts of data. All of these data are in reality messy and needs cleansing before any analysis to be applied [1]. We can simply say that variety is about managing the complexity of multiple data types, i.e. structured, semi-structured and unstructured data. Organizations need to integrate and analyze data from both traditional and non-traditional information sources, from within and outside the enterprise. With the expansion of using sensors, smart phones and social collaboration technologies, data is generated in a variety of forms,
  • 5. including: text, web data, tweets, sensor data, audio, video, click streams, log files and more as discussed in the report from IBM [2]. 4. Veracity It refers to the data uncertainty and the level of reliability associated with certain types of data. One of the critical requirements of Big Data is to have the quality, on the other hand, the available tools to purify the data from its inherited unpredictability is not possible some examples like weather forecast, finance, customer attitudes to buy, and so on [2]. In many organizations there are huge piles of data and in many cases the managers themselves cannot trust the analysis of these data and this uncertainty is very important for the Big Data to be understood from those managers to enable them to take the proper decisions in this continual changing environment. Opportunities to use big data technology and analytics to improve decision- making and performance exist in every industry and managers should be aware of these capabilities. We can take example of the uncertainty of the Big Data in generating energy from natural resources. The amount of data generated about the wind is huge, but still we cannot predict the full picture precisely as we cannot predict the behavior of the weather, the winds and clouds. Despite that, there are still big amount of data that can be valuable and useful to base decisions for future power production. So how do you plan if all these uncertainties are in place? Analysts say through data fusion in which combining multiple less reliable sources to create a more useful data point. An example would be the social comments appended to geospatial location information. The other way to manage uncertainty is using the advanced mathematics such as fuzzy logic and robust optimization techniques. 5. Technique/approach to overcome the challenges The three “Vs”, namely volume, velocity and variety, are the main challenges for the Big Data and there is a requirement for new technology away from the traditional methods used my Rational Database Systems used today to overcome these challenges. One of the approaches used to overcome the issues of Big Data is the Hadoop project which is an open source from Apache that was developed with software libraries which provide reliability, scalability, and distributed system computing. This technology is able to handle the Big Data processing and analytics. It is worth mentioning that Hadoop is widely used at large scale of most Big Data pioneers such as LinkedIn that generates over 100 billion personalized recommendations every week as mentioned in the source [1], and others like twitter as well. To dig into further of the mechanism used with Hadoop, I am going to use simple explanation as follows. The large data set are fragmented or divided into smaller sets, then it is scattered across cluster of servers to do the computation using simple programming method. The number of servers may range from few
  • 6. hundreds to around 2000 thousands or maybe more. The new thing with this computation method is that Hadoop detects and compensates for any hardware failure at the application level whether the traditional method depends on expensive servers. This guarantees the continuity of the services delivered in case of any server failure in any of the clusters. In this case we distributed the computing capabilities among servers of the mass data in a low-cost and effective way [1]. The two key elements of Hadoop are the Hadoop Distributed File System, HDFS, and the MapReduce. The first allows for high bandwidth and the cluster based storage needed by Big Data processing. The second is the data processing framework. The MapReduce is based on Google’s search technology that maps large data sets across the cluster servers. The overall data set is processed in parts with each server and each server is doing his part and then from this it creates a summary. All the summaries are aggregated to the “Reduce stage”. In this way the data is pre-processed before applying traditional data analysis tools [3]. Let’s walk through the technical side a little bit. In the following illustration, we can say that Hadoop consists of two parts, namely the HDFS and the MapReduce. The lower part layer consists of the name node which stores the metadata or the info about the smaller actual data that are processed in the Data node. In the higher layer there is the job tracker who decides what piece of data will run and where. The final part is the task tracker, which runs the code [4]. Figure 1
  • 7. Let us see the differences between the conventional way of processing data and the way it Hadoop, the MapReduce. The following table shows the differences in terms of access, updates, structure, integrity, and scaling. We notice that the data is always moving and dynamic in the MapReduce and writing is discouraged, that the data can scale to higher volumes. Microsoft has adopted Hadoop with some modification to make it easy and user friendly interface and added some connectors to it to make Microsoft with Big Data are Powerview, PowerPivot in Excel, and sharepoint Figure 2 Let us see the differences between the conventional way of processing data and the way it Hadoop, the MapReduce. The following table shows the differences in terms of access, updates, structure, We notice that the data is always moving and dynamic in the MapReduce and writing is discouraged, that the data can scale to higher volumes. Microsoft has adopted Hadoop with some modification to make it easy and user friendly interface and added some connectors to it to make Microsoft-like product. Some of the tools that Microsoft uses to deal th Big Data are Powerview, PowerPivot in Excel, and sharepoint [5]. Let us see the differences between the conventional way of processing data and the way it is used by Hadoop, the MapReduce. The following table shows the differences in terms of access, updates, structure, We notice that the data is always moving and dynamic in the MapReduce and writing is discouraged, and Microsoft has adopted Hadoop with some modification to make it easy and user friendly interface and like product. Some of the tools that Microsoft uses to deal
  • 8. These are some of the tools that are addition, Microsoft is implementing Hadoop on windows Azure and windows server as well. It created JavaScript libraries and frame work for Hadoop and accomplished partnership with Hortonworks Microsoft provided ODBC drivers and hive add enable 3rd party applications to be able to integrate with Hadoop on windows systems [4]. Microsoft is providing, in the following illustration Figure 3 As it is shown in the diagram above, the data maybe structured data, (ERP, CRM, LOB, APPS), or unstructured of different sources, (Sensors, Devices, Bots, Crawlers). It is stored in Enterprise data Warehouse, if it is structured, or to be moved to the uppe platform, Windows server or Azure. It is then processed using SQL Server Analysis Service or SQL Server Reporting Service on the Business Intelligent platform, to be analyzed and to gain insight of all this mi huge Data. At the end, the output is visualized by Excel PowerPivot, Power View, Predictive analytic tools, or Embedded BI tools which all are Microsoft tools that the user is familiar with 3 The source is from Microsoft, reference [5]. ese are some of the tools that are usually used with BI to gain insight of the structured Data. In addition, Microsoft is implementing Hadoop on windows Azure and windows server as well. It created JavaScript libraries and frame work for Hadoop and accomplished partnership with Hortonworks Microsoft provided ODBC drivers and hive add-in for excel to deal with Big Data. The ODBC drivers party applications to be able to integrate with Hadoop on windows systems [4]. the following illustration, solution for the Big Data. As it is shown in the diagram above, the data maybe structured data, (ERP, CRM, LOB, APPS), or unstructured of different sources, (Sensors, Devices, Bots, Crawlers). It is stored in Enterprise data Warehouse, if it is structured, or to be moved to the upper layer to be processed with Hadoop on windows platform, Windows server or Azure. It is then processed using SQL Server Analysis Service or SQL Server Reporting Service on the Business Intelligent platform, to be analyzed and to gain insight of all this mi huge Data. At the end, the output is visualized by Excel PowerPivot, Power View, Predictive analytic tools, BI tools which all are Microsoft tools that the user is familiar with3 . from Microsoft, reference [5]. usually used with BI to gain insight of the structured Data. In addition, Microsoft is implementing Hadoop on windows Azure and windows server as well. It created JavaScript libraries and frame work for Hadoop and accomplished partnership with Hortonworks. Moreover, in for excel to deal with Big Data. The ODBC drivers party applications to be able to integrate with Hadoop on windows systems [4]. As it is shown in the diagram above, the data maybe structured data, (ERP, CRM, LOB, APPS), or unstructured of different sources, (Sensors, Devices, Bots, Crawlers). It is stored in Enterprise data r layer to be processed with Hadoop on windows platform, Windows server or Azure. It is then processed using SQL Server Analysis Service or SQL Server Reporting Service on the Business Intelligent platform, to be analyzed and to gain insight of all this mixed huge Data. At the end, the output is visualized by Excel PowerPivot, Power View, Predictive analytic tools,
  • 9. Oracle is also among the pioneers that are developing methods to solve the Big Data issue. They have developed Oracle Big Data Connectors, Oracle Loader for Hadoop, Oracle Data Integrator [6]. In addition some statistical and analysis capabilities like Open Source Project R and Oracle R Enterprise are developed to take advantage of Hadoop capabilities. Oracle looks to the traditional data and created the tools to facilitate understanding it and gain insight. In the following is traditional data from Oracle perspective. Figure 4 Oracle added new mechanism using the Hadoop technology and their preparatory analytics and BI tools to deal with Big Data. See the following figure. Figure 5 Many of Big Data pioneers deploy the old and new data in parallel that is using the Hadoop alongside with the traditional way. It is also expected that Hadoop will replace other data processing methods and be the dominant solution for Big Data. Big Data will progress as artificial intelligence advances, and as new types of computer processing power become available such as quantum computing which uses quantum mechanical states and is expected to excel theoretically the parallel processing of unstructured data [3]. There are other technologies used in big data include massively parallel-processing (MPP) databases, search-based applications, data-mining grids, distributed file systems , distributed databases, cloud based infrastructure. Almost all of these technologies are not new but there are enhancements in using it is used with Big Data.
  • 10. The Big Data requires a high speed transaction, analysis, and retrieving of data therefore it needs such high capacity hard-drives as SATA drives and/or high speed storage disks such as the Solid State disks, SSD, which are memory- based hard disks. These storage systems are inside the parallel processing nodes used with Big Data. 6. Conclusion In the past decade, the information became a dominant factor in our daily life. Everything surrounding use is digitized and the data kept progressing and moving all the time. The huge amounts of data that is continuously moving and changing became unpredictable and not easy to understand as it is not organized in a way that we can take the benefit of. The main interest in the past for companies is to take whatever information about customer behavior, or take as much data as the medical equipment can take, or collect as much information from the galaxy as the sensor can take but we come to the question “what are we going to do with all these piles of data?” Now we come to an era that companies need to take advantage of all these data in a way that we can take the insight and the value of it. Hadoop is now the major player that Google started to build it algorithm in 2004 with its open source. The pioneers of dealing and processing the databases are now in fast race to adopt and integrate their preparatory analytical and BI tools such as Microsoft and Oracle or IBM. The race is still continuing and the core of all these development is using the parallel processing using Hadoop. We do not know yet what the future is hiding for dealing with the Big Data, is it going to be solved using the new processing algorithms or new hardware adoption using the latest technologies. Is it going to be the issue that comes in front of the queue before the cloud computing? Is the artificial intelligence going to play any role in developing a dynamic algorithm that can cope with the dynamic fast moving data? The question is widely open and future is expected to bring more to us. What’s important is that the key information architecture principles are the same, but the tactics of applying these principles differ from one company to another. We should look to Big Data as an asset that will bring better future for use if we perfectly gain the insight of it. 7. References [1]http://www.explainingcomputers.com. (n.d.). Retrieved June 2, 2013, from http://www.explainingcomputers.com: ] http://www.explainingcomputers.com/big_data.html [2](2013). Retrieved June 4, 2013, from http://public.dhe.ibm.com/: http://public.dhe.ibm.com/common/ssi/ecm/en/gbe03519usen/GBE03519USEN.PDF [3]http://en.wikipedia.org. (2013). Retrieved May 29, 2013, from http://en.wikipedia.org: http://en.wikipedia.org/wiki/Big_data
  • 11. [4]https://www.youtube.com. (n.d.). Retrieved May 29, 2013, from https://www.youtube.com: https://www.youtube.com/watch?v=HM0YX7mpplk [5]http://download.microsoft.com/download/F/A/1/FA126D6D-841B-4565-BB26- D2ADD4A28F24/Microsoft_Big_Data_Solution_Brief.pdf [6]http://www.oracle.com. (n.d.). Retrieved June 4, 2013, from http://www.oracle.com/technetwork/topics/entarch/articles/oea-big-data-guide-1522052.pdf