Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The Good, The Bad and the UglyHow to tame the Big Data BeastGuy LoewenbergMay 2013
Overview• Data Explosion
Overview• Big Data: A collection of data sets so large and complex thatit becomes difficult to process using on-hand datab...
Hadoop Basics• Designed to scale• Uses commodity hardware• Processes data in batches• Can process very large scale of data...
Core Hadoop• Core hadoop is built from two main systems:– Hadoop Clustered file system - HDFS– MapReduce programming frame...
Hadoop architecture• Hadoop Distributed File System (HDFS):self-healing high-bandwidth clusteredstorage.– NameNode control...
Hadoop architecture• MapReduce: Distributed fault-tolerant resourcemanagement and scheduling coupled with ascalable data p...
Hadoop software architectureMapReduce: Parallel data processingframework for large data setsHDFS: Hadoopdistributed File S...
What Hadoop can’t do• Hadoop lets you perform batch analysis on whateverdata you have stored within Hadoop. That data, doe...
Comparing RDBMS to MapReduceRDBMS MapReduceData size Gigabytes PetabytesAccess Interactive and batch BatchStructure Fixed ...
What Hadoop can do• High data volume, stored in Hadoop, and queried atlength later using MapReduce functions– index buildi...
Hadoop Maturity?!• Inaccessible to analysts without programming ability• clusters have no record of who changed which reco...
Choosing your infrastructure• Define what you want to achieve– POC– Scale (few, tens, hundreds)– One-time, periodic, conti...
Choosing your infrastructure• Network infrastructure– Data movement between nodes (rack-awareness,replication factor)– Dat...
Performance & Scale considerations• Consider running on a dedicated/standalone notshared with other Hadoop processes on th...
Thank you!Hadoop - The Good, The Bad and the UglyGuy Loewenberg
SUPPORTING SLIDES
HDFS Architecture
Improving RDBMS with Hadoop• Accelerating nightly batch business processes.• Storage of extremely high volumes of enterpri...
4. hadoop  גיא לבנברג
Upcoming SlideShare
Loading in …5
×

4. hadoop גיא לבנברג

747 views

Published on

Published in: Technology
  • Be the first to like this

4. hadoop גיא לבנברג

  1. 1. The Good, The Bad and the UglyHow to tame the Big Data BeastGuy LoewenbergMay 2013
  2. 2. Overview• Data Explosion
  3. 3. Overview• Big Data: A collection of data sets so large and complex thatit becomes difficult to process using on-hand databasemanagement tools or traditional data processing applications• Hadoop: A framework that allows distributedprocessing of large data-sets across clusters ofcomputers using a simple programming model• 1000 Kilobytes = 1 Megabyte• 1000 Megabytes = 1 Gigabyte• 1000 Gigabytes = 1 Terabyte• 1000 Terabytes = 1 Petabyte• 1000 Petabytes = 1 Exabyte• 1000 Exabytes = 1 Zettabyte• 1000 Zettabytes = 1 Yottabyte• 1000 Yottabytes = 1 Brontobyte• 1000 Brontobytes = 1 GeopbyteMost US SME corporationsMost US large corporationsLeaders like Facebook & Google
  4. 4. Hadoop Basics• Designed to scale• Uses commodity hardware• Processes data in batches• Can process very large scale of data (PBs)
  5. 5. Core Hadoop• Core hadoop is built from two main systems:– Hadoop Clustered file system - HDFS– MapReduce programming framework
  6. 6. Hadoop architecture• Hadoop Distributed File System (HDFS):self-healing high-bandwidth clusteredstorage.– NameNode controls HDFSwhereas DataNodes does theblock replications, read/writeoperations and drives theworkloads for HDFS– Work in a master/slave mode.
  7. 7. Hadoop architecture• MapReduce: Distributed fault-tolerant resourcemanagement and scheduling coupled with ascalable data programming abstraction.– The JobTracker schedulesjobs and allocates activitiesto TaskTracker nodes whichexecute the map and reduceprocesses requested– Work in master/slave mode
  8. 8. Hadoop software architectureMapReduce: Parallel data processingframework for large data setsHDFS: Hadoopdistributed File SystemOozie: MapReducejob SchedulerHBase: Key-valuedatabasePig: Large data setsanalysis languageHive: High-level language foranalyzing large data setsZooKeeper: distributedcoordination systemSolr / Lucene searchengine, query engine library
  9. 9. What Hadoop can’t do• Hadoop lets you perform batch analysis on whateverdata you have stored within Hadoop. That data, doesnot have to be structured– Many solutions take advantage of the low storage expense ofHadoop to store structured data there instead of RDBMS. Butshifting data back and forth between Hadoop and an RDBMSwould be overkill.– Transactional data is highly complex, as a transaction on anecommerce site can generate many steps that all have to beimplemented quickly. That scenario is not ideal for Hadoop– Structured data sets that require very minimal latency
  10. 10. Comparing RDBMS to MapReduceRDBMS MapReduceData size Gigabytes PetabytesAccess Interactive and batch BatchStructure Fixed schema Unstructured schemaLanguage SQL Procedural (Java, C++, Ruby, etc)Integrity High LowScaling Nonlinear LinearUpdates Read and write Write once, read many timesLatency Low High
  11. 11. What Hadoop can do• High data volume, stored in Hadoop, and queried atlength later using MapReduce functions– index building– pattern recognitions– creating recommendation engines– sentiment analysis• Hadoop should be integrated within your existing ITinfrastructure in order to capitalize on the countlesspieces of data that flows into your organization.
  12. 12. Hadoop Maturity?!• Inaccessible to analysts without programming ability• clusters have no record of who changed which record and whenit was changed• storage functionality they have always depended on (snapshots,mirroring) are lacking in HDFS.• Incompatibility with existing tools• Data without structure has limited value and applying thestructure at query time requires a lot of Java code.• Limited documentation• Limited troubleshooting capabilities
  13. 13. Choosing your infrastructure• Define what you want to achieve– POC– Scale (few, tens, hundreds)– One-time, periodic, continuous• Infrastructure design– Servers, storage, network, rack-space– Define a joined team Hadoop App/Dev and infrastructurespecialist (facilities/server/network) when building a solution– Virtual machines vs. Physical machines (IO performance, HighCPU, Network)
  14. 14. Choosing your infrastructure• Network infrastructure– Data movement between nodes (rack-awareness,replication factor)– Data between sites (Hosting/Service)• Storage (architecture, disks)– Local disks, JBOD– Increase default block-size• Operations– Monitor– Backup (configuration files, journal, Checkpoint …)
  15. 15. Performance & Scale considerations• Consider running on a dedicated/standalone notshared with other Hadoop processes on the sameserver– Name Node, Secondary Name Node and/or CheckpointNode– Job Tracker and the HBASE (or any DB) Master• Consider a Physical dedicated environment
  16. 16. Thank you!Hadoop - The Good, The Bad and the UglyGuy Loewenberg
  17. 17. SUPPORTING SLIDES
  18. 18. HDFS Architecture
  19. 19. Improving RDBMS with Hadoop• Accelerating nightly batch business processes.• Storage of extremely high volumes of enterprise data• Creation of automatic redundant backups• Improving the scalability of applications• Use of Java for data processing instead of SQL.• Produce just-in-time feeds for dashboards and business intelligence• Handling urgent, ad hoc requests for data• Turning unstructured data into relational data• Taking on tasks that require massive parallelism• Moving existing algorithms, code, frameworks, and components toa highly distributed computing environment.

×