Big Data
Hadoop
Tomy Rhymond | Sr. Consultant | HMB Inc. | ttr@hmbnet.com | 614.432.9492
“Torture the data, and it will co...
Agenda
• Data
• Big Data
• Hadoop
• Microsoft Azure HDInsight
• Hadoop Use Cases
• Demo
• Configure Hadoop Cluster / Azure...
Huston, we have a “Data” problem.
• IDC estimate put the size of the “digital universe” at 40 zettabytes (ZB) by
2020, whi...
85% Unstructured, 15% Structured
• The data as we know is structured.
• Structured data refers to information with a high ...
Data Types
Relational Data – SQL Data
Un-Structured Data – Twitter Feed
Semi-Structured Data – Json Un-Structured Data – A...
So What is Big Data?
• Big Data is a popular term used to describe the exponential growth and
availability of data, both s...
VARIETY
BIG DATA
VOLUME
VERACITYVELOCITY
Scale of Data Different
Forms of Data
Analysis of
Data
Uncertainty of
Data
Hadron...
Big Data vs Traditional Data
Traditional Big Data
Data Size Gigabytes Petabytes
Access Interactive and Batch Batch
Updates...
Data Storage
• Storage capacity of the hard drives have increased massively over the years
• On the other hand, the access...
Why big data should matter to you
• The real issue is not that you are acquiring large amounts of data. It's what you do w...
Ok I Got BigData, Now what?
• The huge influx of data raises many challenges.
• Process of inspecting, cleaning, transform...
Challenges of Big Data
• Information Growth
• Over 80% of the data in the enterprise consists of unstructured data, growin...
Hadoop
• Apache™ Hadoop® is an open source software project that enables the
distributed processing of large data sets acr...
History of Hadoop
• Hadoop is not an acronym; it’s a made-up name.
• Named after stuffed an yellow elephant of Doug Cuttin...
Hadoop Modules
• Hadoop Common: The common utilities that support the other Hadoop
modules.
• Hadoop Distributed File Syst...
HDFS – Hadoop Distributed File System
• The heart of Hadoop is the HDFS.
• The Hadoop Distributed File System (HDFS) is a ...
Hadoop Architecture
• NameNode:
• NameNode is the node which stores the filesystem metadata
i.e. which file maps to what b...
HDFS - InputSplit
• InputFormat
• Split the input blocks and files into logical
chunks of type InputSplit, each of which i...
MapReduce
• Hadoop MapReduce is a software framework for easily
writing applications which process vast amounts of data
(m...
Hadoop Distributions
• Microsoft Azure HDInsight
• IBM InfoSphere BigInsights
• Hortonworks
• Amazon Elastic MapReduce
• C...
Hadoop Meets The Mainframe
• BMC
• Control-M for Hadoop is an extension of BMC’s larger Control-M product suite that was
b...
Azure HDInsight
• HDInsight makes Apache Hadoop available as a service in the cloud.
• Process, analyze, and gain new insi...
Azure HDInsight
Scale elastically on demand
Crunch all data –
structured, semi-
structured,
unstructured
Develop in your
f...
HDInsight Ecosystem
HDFS (Hadoop Distributed File System)
MapReduce (Job Scheduling / Execution)
Pig (Data Flow) Hive (SQL...
HDInsight
• The combination of Azure Storage and HDInsight provides an
ultimate framework for running MapReduce jobs.
• Cr...
Use cases
• A 360 degree view of the customer
• Business want to know to utilize social media postings to improve
revenue....
Demo
• Configure HDInsight Cluster
• Create Mapper and Reducer Program using Visual Studio C#
• Upload Data to Blob Storag...
VARIETY
BIG DATA
VOLUME
VERACITYVELOCITY
Scale of Data Different
Forms of Data
Analysis of
Data
Uncertainty of
Data
Hadron...
Resources for HDInsight for Windows Azure
Microsoft: HDInsight
• Welcome to Hadoop on Windows Azure - the welcome page for...
About Me
Tomy Rhymond
Sr. Consultant, HMB, Inc.
ttr@hmbnet.com
http://tomyrhymond.wordpress.com
@trhymond
614.432.9492 (m)
Upcoming SlideShare
Loading in...5
×

Big data with Hadoop - Introduction

1,037

Published on

Big Data with Hadoop and HDInsight. This is an intro to the technology. If you are new to BigData or just heard of it. This presentation help you to know just little bit more about the technology.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,037
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
90
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Big data with Hadoop - Introduction

  1. 1. Big Data Hadoop Tomy Rhymond | Sr. Consultant | HMB Inc. | ttr@hmbnet.com | 614.432.9492 “Torture the data, and it will confess to anything.” -Ronald Coase, Economics, Nobel Prize Laureate "The goal is to turn data into information, and information into insight." – Carly Fiorina "Data are becoming the new raw material of business." – Craig Mundie, Senior Advisor to the CEO at Microsoft. “In God we trust. All others must bring data.” – W. Edwards Deming, statistician, professor, author, lecturer, and consultant.
  2. 2. Agenda • Data • Big Data • Hadoop • Microsoft Azure HDInsight • Hadoop Use Cases • Demo • Configure Hadoop Cluster / Azure Storage • C# MapReduce • Load and Analyze with Hive • Use Pig Script for Analyze data • Excel Power Query
  3. 3. Huston, we have a “Data” problem. • IDC estimate put the size of the “digital universe” at 40 zettabytes (ZB) by 2020, which is 50-fold growth from the beginning of 2010. • By 2020, emerging markets will supplant the developed world as the main producer of the world’s data. • This flood of data is coming from many source. • The New York Stock Exchange generates about 1 terabytes of trade data per day • Facebook hosts approximately one petabyte of storage • The Hadron Collider produce about 15 petabytes of data per year • Internet Archives stores around 2 petabytes of data and growing at a rate of 20 terabytes per month. • Mobile devices and Social Network attribute to the exponential growth of the data.
  4. 4. 85% Unstructured, 15% Structured • The data as we know is structured. • Structured data refers to information with a high degree of organization, such as inclusion in a relational database is seamless and readily searchable. • Not all data we collect conform to a specific, pre-defined data model. • It tends to be the human-generated and people-oriented content that does not fit neatly into database tables • 85 percent of business-relevant information originates in unstructured form, primarily text. • Lack of structure make compilation a time and energy-consuming task. • These data are so large and complex that it becomes difficult to process using on-hand management tools or traditional data processing applications. • These type of data is being generated by everything around us at all times. • Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it.
  5. 5. Data Types Relational Data – SQL Data Un-Structured Data – Twitter Feed Semi-Structured Data – Json Un-Structured Data – Amazon Review
  6. 6. So What is Big Data? • Big Data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. • Capturing and managing lot of information; Working with many new types of data. • Exploiting these masses of information and new data types of applications and extract meaningful value from big data • The process of applying serious computing to seriously massive and often highly complex sets of information. • Big data is arriving from multiple sources at an alarming velocity, volume and variety. • More data lead to more accurate analyses. More accurate analysis may lead to more confident decision making.
  7. 7. VARIETY BIG DATA VOLUME VERACITYVELOCITY Scale of Data Different Forms of Data Analysis of Data Uncertainty of Data Hadron Collider generates 1 PETA BYTES Of Data are create per year Estimated 100 TERA BYTES Of Data per US Company IDC Estimate 40 ZETABYTES Of Data by 2020 500 MILLION TWEETS Per day 100 MILLION VIDEO 600 Years of Video 13 Hours of video uploaded per minute 20 BILLION NETWORK CONNECTIONS By 2016 NY Stock Exchange generates 1 TERRA BYTES Of Trade Data per day Poor Data Quality cost businesses 600 BILLION A YEAR 30% OF DATA COLLECTED By marketers are not usable for real-time decision making Poor data across business and the government costs the US economy 3.1 TRILLION DOLLARS a year 1 IN 3 LEADERS Don t trust the information they user to make decision MAP REDUCE RESULT 200 BILLIONS PHOTOS Facebook has 1 PETTA BYTES Of Storage 1.8 BILLION SMARTPHONES Estimated 6 BILLION PEOPLE Have a cell Phone Global Healthcare data 150 EXABYTES 2.4 EXABYTES per year Growth 2.5 QUINTILLION BYTES of Data are Created each Day Big Data The 4 V’s of Big Data Volume: We currently see the exponential growth in the data storage as the data is now more than text data. There are videos, music and large images on our social media channels. It is very common to have Terabytes and Petabytes of the storage system for enterprises. Velocity: Velocity describes the frequency at which data is generated, captured and shared. Recent developments mean that not only consumers but also businesses generate more data in much shorter cycles. Variety: Today’s data no longer fits into neat, easy to consume structures. New types include content, geo-spatial, hardware data points, location based, log data, machine data, metrics, mobile, physical data points, process, RFID etc. Veracity: This refers to the uncertainty of the data available. Veracity isn’t just about data quality, it’s about data understandability. Veracity has an impact on the confidence data.
  8. 8. Big Data vs Traditional Data Traditional Big Data Data Size Gigabytes Petabytes Access Interactive and Batch Batch Updates Read and Write many times Write once, read many times Structure Static Schema Dynamic Schema Integrity High Low Scaling Nonlinear Linear
  9. 9. Data Storage • Storage capacity of the hard drives have increased massively over the years • On the other hand, the access speeds of the drives have not kept up. • Drive from 1990 could store 1370 MB of data and had a speed of 4.4 MB/s • can read all the data in about 5 mins. • Today One Terabyte drives are the norm, but the transfer rate is around 100 MB/s • Take more than two and half hours to read all the data • Writing is even slower • The obvious ways to reduce time is to read from multiple disks at once • Have 100 disks each holding one hundredth of data. Working in parallel, we could read all the data in under 2 minutes. • Move Computing to Data rather than bring data to computing.
  10. 10. Why big data should matter to you • The real issue is not that you are acquiring large amounts of data. It's what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable • cost reductions • time reductions • new product development and optimized offerings • smarter business decision making. • By combining big data and high-powered analytics, it is possible to: • Determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually. • Send tailored recommendations to mobile devices while customers are in the right area to take advantage of offers. • Quickly identify customers who matter the most. • Generate retail coupons at the point of sale based on the customer's current and past purchases.
  11. 11. Ok I Got BigData, Now what? • The huge influx of data raises many challenges. • Process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information • To analyze and extract meaningful value from these massive amounts of data, we need optimal processing power. • We need parallel processing and therefore requires many pieces of hardware • When we use many pieces of hardware, the chances that one will fail is fairly high. • Common way to avoiding data loss is through replication • Redundant copies of data are kept • Data analysis tasks need to combine data • The Data from one disk may need to combine with data from 99 other disks
  12. 12. Challenges of Big Data • Information Growth • Over 80% of the data in the enterprise consists of unstructured data, growing much faster pace than traditional data • Processing Power • The approach to use single, expensive, powerful computer to crunch information doesn’t scale for Big Data • Physical Storage • Capturing and managing all this information can consume enormous resources • Data Issues • Lack of data mobility, proprietary formats and interoperability obstacle can make working with Big Data complicated • Costs • Extract, transform and load (ETL) processes for Big Data can be expensive and time consuming
  13. 13. Hadoop • Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. • It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. • All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. • Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine and the Hadoop Distributed File System (HDFS).
  14. 14. History of Hadoop • Hadoop is not an acronym; it’s a made-up name. • Named after stuffed an yellow elephant of Doug Cutting’s (Project Creator) son. • 2002-2004 : Nutch Project - web-scale, open source, crawler-based search engine. • 2003-2004: Google released GFS (Google File System) & MapReduce • 2005-2006: Added GFS (Google File System) & MapReduce impl to Nutch • 2006-2008: Yahoo hired Doug Cutting and his team. They spun out storage and processing parts of Nutch to form Hadoop. • 2009 : Achieved Sort 500 GB in 59 Seconds (on 1400 nodes) and 100 TB in 173 Minutes (on 3400 nodes)
  15. 15. Hadoop Modules • Hadoop Common: The common utilities that support the other Hadoop modules. • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. • Hadoop YARN: A framework for job scheduling and cluster resource management. • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. • Other Related Modules; • Cassandra - scalable multi-master database with no single points of failure. • HBase - A scalable, distributed database that supports structured data storage for large tables. • Pig - A high-level data-flow language and execution framework for parallel computation. • Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying. • Zookeeper - A high-performance coordination service for distributed applications.
  16. 16. HDFS – Hadoop Distributed File System • The heart of Hadoop is the HDFS. • The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. • HDFS is designed on the following assumptions and goals: • Hardware failure is norm rather than exception. • HDFS is designed more for batch processing rather than interactive use by users. • Application that run on HDFS have large data sets. A typical file in HDFS is in gigabytes to terabytes in size. • HDFS application uses a write-once-read-many access model. A file once created, written and closed need not be changed. • A computation requested by an application is much more efficient if it is executed near the data it operated on. On other words, Moving computation is cheaper than moving data. • Easily portable from one platform to another.
  17. 17. Hadoop Architecture • NameNode: • NameNode is the node which stores the filesystem metadata i.e. which file maps to what block locations and which blocks are stored on which datanode. • Secondary NameNode: • NameNode is the single point of failure. • DataNode: • The data node is where the actual data resides. • All datanodes send a heartbeat message to the namenode every 3 seconds to say that they are alive. • The data nodes can talk to each other to rebalance data, move and copy data around and keep the replication high. • Job Tracker/Task Tracker: • The primary function of the job tracker is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking its progress, fault tolerance etc.) • The task tracker has a simple function of following the orders of the job tracker and updating the job tracker with its progress status periodically. Name Node Secondary Name Node text B1 B2 B3 B4 B5 B6 B1 Data Node B1 B2 B3 B4 B5 B6 B1 Data Node text B1 B2 B3 B4 B5 B6 B1 Data Node B1 B2 B3 B4 B5 B6 B1 Data Node B1 B2 B3 B4 B5 B6 B1 Data Node Rack 1 Rack 2 Client Client Read Metadata Ops Replication
  18. 18. HDFS - InputSplit • InputFormat • Split the input blocks and files into logical chunks of type InputSplit, each of which is assigned to a map task for processing. • RecordReader • A RecordReader uses the data within the boundaries created by the input split to generate key/value pairs.
  19. 19. MapReduce • Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. • It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. • The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. • The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). • The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. BIG DATA MAP REDUCE RESULT (tattoo, 1) Hadoop Map-Reduce
  20. 20. Hadoop Distributions • Microsoft Azure HDInsight • IBM InfoSphere BigInsights • Hortonworks • Amazon Elastic MapReduce • Cloudera CDH
  21. 21. Hadoop Meets The Mainframe • BMC • Control-M for Hadoop is an extension of BMC’s larger Control-M product suite that was born in 1987 as an automated mainframe job scheduler. • Compuware • APM is an application performance management suite that also spans the arc of data enterprise data center computing distributed commodity servers. • Syncsort • Syncsort offers Hadoop Connectivity to move data between Hadoop and other platforms including the mainframe, Hadoop Sort Acceleration, and Hadoop ETL for cross-platform data integration. • Informatica • HParser can run its data transformation services as a distributed application on Hadoop’s MapReduce engine.
  22. 22. Azure HDInsight • HDInsight makes Apache Hadoop available as a service in the cloud. • Process, analyze, and gain new insights from big data using the power of Apache Hadoop • Drive decisions by analyzing unstructured data with Azure HDInsight, a big data solution powered by Apache Hadoop. • Build and run Hadoop clusters in minutes. • Analyze results with Power Pivot and Power View in Excel. • Choose your language, including Java and .NET. Query and transform data through Hive.
  23. 23. Azure HDInsight Scale elastically on demand Crunch all data – structured, semi- structured, unstructured Develop in your favorite language No hardware to acquire or maintain Connect on- premises Hadoop clusters with the cloud Use Excel to visualize your Hadoop data Includes NoSQL transactional capabilities Azure HDInsight
  24. 24. HDInsight Ecosystem HDFS (Hadoop Distributed File System) MapReduce (Job Scheduling / Execution) Pig (Data Flow) Hive (SQL) Sqoop ETL Tools BI Tools RDBMS
  25. 25. HDInsight • The combination of Azure Storage and HDInsight provides an ultimate framework for running MapReduce jobs. • Creating an HDInsight cluster is quick and easy: log in to Azure, select the number of nodes, name the cluster, and set permissions. • The cluster is available on demand, and once a job is completed, the cluster can be deleted but the data remains in Azure Storage. • Use Powershell to submit MapReduce Jobs • Use C# to create MapReduce Programs • Support Pig Latin, Avro, Sqoop and more.
  26. 26. Use cases • A 360 degree view of the customer • Business want to know to utilize social media postings to improve revenue. • Utilities: Predict power consumption • Marketing: Sentiment analysis • Customer service: Call monitoring • Retail and marketing: Mobile data and location-based targeting • Internet of Things (IoT) • Big Data Service Refinery
  27. 27. Demo • Configure HDInsight Cluster • Create Mapper and Reducer Program using Visual Studio C# • Upload Data to Blob Storage using Azure Storage Explorer • Run Hadoop Job • Export output to Power Query for Excel • Hive Example with HDInsight • Pig Script with HDInsight
  28. 28. VARIETY BIG DATA VOLUME VERACITYVELOCITY Scale of Data Different Forms of Data Analysis of Data Uncertainty of Data Hadron Collider generates 1 PETA BYTES Of Data are create per year Estimated 100 TERA BYTES Of Data per US Company IDC Estimate 40 ZETABYTES Of Data by 2020 500 MILLION TWEETS Per day 100 MILLION VIDEO 600 Years of Video 13 Hours of video uploaded per minute 20 BILLION NETWORK CONNECTIONS By 2016 NY Stock Exchange generates 1 TERRA BYTES Of Trade Data per day Poor Data Quality cost businesses 600 BILLION A YEAR 30% OF DATA COLLECTED By marketers are not usable for real-time decision making Poor data across business and the government costs the US economy 3.1 TRILLION DOLLARS a year 1 IN 3 LEADERS Don t trust the information they user to make decision MAP REDUCE RESULT 200 BILLIONS PHOTOS Facebook has 1 PETTA BYTES Of Storage 1.8 BILLION SMARTPHONES Estimated 6 BILLION PEOPLE Have a cell Phone Global Healthcare data 150 EXABYTES 2.4 EXABYTES per year Growth 2.5 QUINTILLION BYTES of Data are Created each Day Big Data
  29. 29. Resources for HDInsight for Windows Azure Microsoft: HDInsight • Welcome to Hadoop on Windows Azure - the welcome page for the Developer Preview for the Apache Hadoop-based Services for Windows Azure. • Apache Hadoop-based Services for Windows Azure How To Guide - Hadoop on Windows Azure documentation. • Big Data and Windows Azure - Big Data scenarios that explore what you can build with Windows Azure. Microsoft: Windows and SQL Database • Windows Azure home page - scenarios, free trial sign up, development tools and documentation that you need get started building applications. • MSDN SQL- MSDN documentation for SQL Database • Management Portal for SQL Database - a lightweight and easy-to-use database management tool for managing SQL Database in the cloud. • Adventure Works for SQL Database - Download page for SQL Database sample database. Microsoft: Business Intelligence • Microsoft BI PowerPivot- a powerful data mashup and data exploration tool. • SQL Server 2012 Analysis Services - build comprehensive, enterprise-scale analytic solutions that deliver actionable insights. • SQL Server 2012 Reporting - a comprehensive, highly scalable solution that enables real-time decision making across the enterprise. Apache Hadoop: • Apache Hadoop - software library providing a framework that allows for the distributed processing of large data sets across clusters of computers. • HDFS - Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. • Map Reduce - a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. Hortonworks: • Sandbox - Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials.
  30. 30. About Me Tomy Rhymond Sr. Consultant, HMB, Inc. ttr@hmbnet.com http://tomyrhymond.wordpress.com @trhymond 614.432.9492 (m)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×