Hadoop Technology


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop Technology

  1. 1. TURKCELL DAHİLİ Hadoop
  2. 2. Big Data TURKCELL DAHİLİ Volume: Velocity: Often time sensitive , big data must be used as it is streaming in to the enterprise it order to maximize its value to the business. Batch ,Near time , Real-time ,streams Big data comes in one size : large . Enterprises are awash with data ,easy amassing terabytes and even petabytes of information. TB , Records , Transactions ,Tables , Files. Value Variety: Verification: Big data extends beyond structured data , including semi-structured and unstructured data to all varieties :text , audio , video ,click streams ,log files and more With all the big data there will be bad data and with diverse data there will be more diverse quality and security levels of users. Structured , Unstructured , Semi-structured Good , Undefined , bad , Inconsistency , Incompleteness , Ambiguity
  3. 3. TURKCELL DAHİLİ Big Data – Data Sources
  4. 4. TURKCELL DAHİLİ Big Data – Data Growth
  5. 5. Hadoop Characteristics • • • • • • Open Source Distributed data replication Commodity hardware Data and analysis co-location Scalability Reliable error handling TURKCELL DAHİLİ
  6. 6. Hadoop Storyline TURKCELL DAHİLİ 2003 2006 2008 2009 2011 2012 Apache Hadoop Project started for Yahoo requirements Google published GFS & MapReduce Paper Cloudera founded First commercial Hadoop Distribution released. Enterprise support is available Hortonworks founded Ecosystem reaches 300 companies
  7. 7. TURKCELL DAHİLİ Hadoop for Enterprise
  8. 8. RDBMS vs. Hadoop TURKCELL DAHİLİ
  9. 9. RDBMS vs. Hadoop TURKCELL DAHİLİ RDBMS Hadoop Data Size Terabytes Petabytes Schema Required on Write Required on Read Speed Reads are fast Writes are fast Access Interactive and Batch Batch Updates Write and Read Many Times Write once Read many times Scaling Scale up Scale out Data Types Structured Multi and unstructured Integrity High Low Best Use Interactive OLAP Analytics Complex ACID transactions Operational Data Store Data Discovery Processing unstructured data Massive storage/processing
  10. 10. TURKCELL DAHİLİ Benefits of Analysing with Hadoop • Previously impossible or impractical to do analysis • Analysis conducted at lower cost • Greater flexibility
  11. 11. TURKCELL DAHİLİ BigData &Hadoop in Turkcell • Processing «Big Data» since 2009 with Cirrus • Hadoop is on production since December’12 • ~4.5B records/~3.5TB data is processed with Cirrus • Data not stored for future analysis • Cloudera Distribution for Hadoop (nonsupported) • 5 x 24 core machines with SAN storage (not reference arch)
  12. 12. TURKCELL DAHİLİ Common Hadoop-able Problems • • • • • Modeling True Risk Customer Churn Analysis Recommendation Engine Ad Targeting Pos Transaction Analysis • Analyze Network Data to Predict failure • Threat Analysis • Search Quality • Data ‘sandbox’
  13. 13. Modeling True Risk TURKCELL DAHİLİ
  14. 14. Modeling True Risk TURKCELL DAHİLİ • Source, parse and aggregate disparate data sources to build comprehensive data picture • E.g. Credit card records, call recordings, chat sessions,emails, banking activity • Structure and analyze • Sentiment analysis, graph creation, pattern recognition • Typical industry • Financial Services(Banks, Insurance)
  15. 15. TURKCELL DAHİLİ Customer Churn Analysis
  16. 16. TURKCELL DAHİLİ Customer Churn Analysis • Rapidly test and build behavioral model of customer from disparate sources • Structure and analyse with Hadoop • Traversing • Graph creation • Pattern recognition • Typical Industry • Telecommunications, Financial Services
  17. 17. TURKCELL DAHİLİ Recommendation Engine
  18. 18. TURKCELL DAHİLİ Recommendation Engine • Batch processing framework • Allow execution in parallel over large datasets • Collaborative filtering • Collecting ‘taste’ information from many users • Utilizing information to predict what similar users like • Typical industry • Ecommerce, Manufacturing, Retail
  19. 19. Ad Targeting TURKCELL DAHİLİ
  20. 20. Ad Targeting TURKCELL DAHİLİ • Data analysis can be conducted in parallel, reducing processing times from days to hours • With hadoop, as data volumes grow the only expansion cost is hardware • Add more nodes without degradation in performance • Typical Industry • Advertising
  21. 21. TURKCELL DAHİLİ Point of Sale Transaction Analysis
  22. 22. TURKCELL DAHİLİ Point of Sale Transaction Analysis • Batch processing framework • Allow execution in parallel over large datasets • Pattern Recognition • Optimizing over multiple data sources • Utilizing information to predict demand • Typical Industry • Retail
  23. 23. Analyzing Network Data to PredictTURKCELL DAHİLİ Failure
  24. 24. Analyzing Network Data to PredictTURKCELL DAHİLİ Failure • Take the computation to the data • Extending the range of indexing techniques from simple scans to more complex data mining • Better understand how network reacts to fluctuations • How previously thought discrete anomalies may, in fact, be interconnected • Identify leading indicators of components • Typical Industry • Utilities, Telecommunications, Datacenters
  25. 25. Threat Analysis TURKCELL DAHİLİ
  26. 26. Threat Analysis • Parallel processing over huge datasets • Pattern recognition to identify anomalies i.e. Threats • Typical Industry • Security, Financial Services, click fraud.. TURKCELL DAHİLİ
  27. 27. Search Quality TURKCELL DAHİLİ
  28. 28. Search Quality • Analysing search attempts in conjunction with structured data • Pattern recognition • Browsing pattern of users performing searches in different categories • Typical Industry • Web • Ecommerce TURKCELL DAHİLİ
  29. 29. Data ‘Sandbox’ TURKCELL DAHİLİ
  30. 30. Data ‘Sandbox’ TURKCELL DAHİLİ • With Hadoop an organization can dump all this data into HDFS cluster • Then use Hadoop to start trying out different analysis on data • See patterns or relationships that allow the organization to derive additional value from data • Typical Industry • Common across all industries
  31. 31. TURKCELL DAHİLİ Hadoop Core
  32. 32. Apache Hadoop Core TURKCELL DAHİLİ • Hadoop is a distributed storage and processing technology for large scale applications • HDFS: Self healing, distributed file system for multi-structured data; breaks files into blocks & stores redundantly across cluster. • Map Reduce: Framework for running large data processing jobs in parallel across many nodes & combining results.
  33. 33. Master/Slave Model TURKCELL DAHİLİ
  34. 34. TURKCELL DAHİLİ Hadoop Distributed File System • The Hadoop Distributed File System (HDFS) stores files across all of the nodes in a Hadoop cluster. • It handles breaking the files into large blocks and distributing them across different machines. • It also makes multiple copies of each block so that if any one machine fails, no data is lost or unavailable.
  35. 35. HDFS- Features • • • • • TURKCELL DAHİLİ Highly fault-tolerant High throughput Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware
  36. 36. TURKCELL DAHİLİ Hadoop Distributed File System • The brain of HDFS is the NameNode. • • • • Maintains the master list of files in HDFS Handles mapping of filenames to blocks Knows where each block is stored Ensure each block is replicated the appropriate number of times. • DataNodes are machines that store HDFS data. • Each DataNode is colocated with a TaskTracker to allow moving of the computation to data.
  37. 37. HDFS-Design TURKCELL DAHİLİ • Very Large files • Streaming data access • Time to read the whole file is more important than the reading the first record • Commodity hardware • Optimized for high throughput • Not fit for • Low latency data access • Lots of small files • Multiple writers, arbitrary file modifications
  38. 38. HDFS architecture TURKCELL DAHİLİ
  39. 39. MapReduce TURKCELL DAHİLİ • MapReduce is the framework for running jobs in Hadoop. It provides a simple and powerful paradigm for parallelizing data processing. • The JobTracker is the central coordinator of jobs in MapReduce. It controls which jobs are being run, which resources they are assigned, etc. • On each node in the cluster there is a TaskTracker that is responsible for running the map or reduce tasks assigned to it by the JobTracker.
  40. 40. TURKCELL DAHİLİ Hadoop Ecosystem
  41. 41. Hadoop Ecosystem TURKCELL DAHİLİ
  43. 43. YARN TURKCELL DAHİLİ • The YARN resource manager, which coordinates the allocation of compute resources on the cluster. • The YARN node managers, which launch and monitor the compute containers on machines in the cluster. • The MapReduce application master, which coordinates the tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node managers.
  44. 44. Pig TURKCELL DAHİLİ • Pig provides an engine for executing data flows in parallel on Hadoop. • PigLatin is a simple-to-understand data flow language used in the analysis of large data sets. • Pig scripts are automatically converted into MapReduce jobs by the Pig interpreter • Pig has an optimizer that rearranges some operations in Pig Latin scripts to give better performance, combines MapReduce jobs together
  45. 45. Hive TURKCELL DAHİLİ • Is a datawarehouse system layer built on Hadoop • Allows you to define a structure for your unstructured Big Data • Simplifies analysis and queries with an SQL like scripting language called HiveQL • Produces MapReduce jobs in background • Extensible (UDFs,UDAFs,UDTFs) • Support uses such as: • Adhoc queries • Summarization • Data Analysis
  46. 46. Hive is not TURKCELL DAHİLİ • … a relational database • … designed for online transaction processing • … suited for realtime queries and row-level updates
  47. 47. Stinger for Hive TURKCELL DAHİLİ
  48. 48. Ambari TURKCELL DAHİLİ • Ambari for Hadoop Clusters • Provision • Manage • Monitor
  49. 49. Ambari TURKCELL DAHİLİ • Provides step-by-step wizard for installing Hadoop services across any number of hosts • Handles configuration of Hadoop services for the cluster.
  50. 50. Sqoop and Flume TURKCELL DAHİLİ • Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large quantities of streaming data(ex:logs) into HDFS. It has a simple and flexible architecture based on streaming data flows.
  51. 51. Schemas - HCatalog TURKCELL DAHİLİ • A table and storage management service for data created using Apache Hadoop • Providing a shared schema and data type mechanism. • Providing a table abstraction so that users need not be concerned with where or how their data is stored. • Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive. • Example • stocks_daily= load ‘nyse_daily' using HCatLoader(); • cleansed = filter stocks_daily by symbol is not null;
  52. 52. Mahout TURKCELL DAHİLİ • The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries. • Core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. • The core libraries are highly optimized to allow for good performance also for non-distributed algorithms
  53. 53. TURKCELL DAHİLİ Hadoop Core in Detail
  54. 54. Map Phase TURKCELL DAHİLİ • In the map phase, MapReduce gives the user an opportunity to operate on every record in the data set individually. This phase is commonly used to project out unwanted fields, transform fields, or apply filters. • Certain types of joins and grouping can also be done in the map (e.g., joins where the data is already sorted or hash-based aggregation).
  55. 55. Data Locality TURKCELL DAHİLİ
  56. 56. Combiner Phase TURKCELL DAHİLİ • Minimize the data transferred between map and reduce tasks. • The combiner gives applications a chance to apply their reducer logic early on.
  57. 57. Shuffle Phase TURKCELL DAHİLİ • Data arriving on the reducer has been partitioned and sorted by the map, combine, and shuffle phases. • By default, the data is sorted by the partition key. For example, if a user has a data set partitioned on user ID, in the reducer it will be sorted by user ID as well. Thus, MapReduce uses sorting to group like keys together. • It is possible to specify additional sort keys beyond the partition key
  58. 58. Shuffle TURKCELL DAHİLİ
  59. 59. Reduce Phase TURKCELL DAHİLİ • The input to the reduce phase is each key from the shuffle plus all of the records associated with that key. • Because all records with the same value for the key are now collected together, it is possible to do joins and aggregation operations such as counting. • The MapReduce user explicitly controls parallelism in the reduce.
  60. 60. Reduce Phase TURKCELL DAHİLİ
  61. 61. Output Phase TURKCELL DAHİLİ • The reducer (or map in a map-only job) writes its output via an OutputFormat. • OutputFormat is responsible for providing a RecordWriter, which takes the key-value pairs produced by the task and stores them. • This includes serializing, possibly compressing, and writing them to HDFS, HBase, etc
  62. 62. TURKCELL DAHİLİ Map Reduce Logical Flow
  63. 63. TURKCELL DAHİLİ Map Reduce Logical Flow
  64. 64. TURKCELL DAHİLİ MapReduce Processing Model
  65. 65. Speculative Execution TURKCELL DAHİLİ • If a Mapper runs slower than the others, a new instance of the Mapper will be started on another machine operating on the same data. • The result of the first Mapper to finish will be used. • Hadoop will kill of the Mapper which is still running.
  66. 66. Distributed Cache TURKCELL DAHİLİ • Sometimes all or many of the tasks in a MapReduce job will need to access a single file or a set of files. • When thousands of map or reduce tasks attempt to open the same HDFS file simultaneously, this puts a large strain on the NameNode and the DataNodes storing that file. • To avoid this situation, MapReduce provides the distributed cache. • The distributed cache allows users to specify—as part of their MapReduce job—any HDFS files they want every task to have access to. • These files are then copied onto the local disk of the task nodes as part of the task initiation. Map or reduce tasks can then read these as local files.
  67. 67. TURKCELL DAHİLİ Setting up Environment • Hortonworks Sandbox: http://hortonworks.com/products/sandboxinstructions/ • VMware: http://www.vmware.com/products/player/overvi ew.html • Setup Guide:http://hortonworks.com/wpcontent/uploads/2013/03/InstallingHortonworks SandboxonWindowsUsingVMwarePlayerv2.pdf
  68. 68. TURKCELL DAHİLİ Hortonworks Sandbox
  69. 69. TURKCELL DAHİLİ Hortonworks Sandbox
  70. 70. MapReduce Demo • Eclipse Plugin: • HDFS Operations • Running WordCount, TopK • Generating jars for HDP Sandbox • Sandbox: • HDFS Operations • Loading and Running jar files • Oozie and Ambari TURKCELL DAHİLİ
  71. 71. Hive Demo • • • • • Create Table with HCatalog Load Data in to Hive Query Data OUTPUT to Table/HDFS/Local JOIN TURKCELL DAHİLİ
  72. 72. Pig Demo • • • • Load Data Transform Grouping JOIN TURKCELL DAHİLİ
  73. 73. TURKCELL DAHİLİ Teşekkürler