Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Practical Introduction toHadoopAlex Gorbachev30-May-2013
© 2012 – PythianWho is Pythian?15 Years of Datainfrastructure managementconsulting170+ Top brands6000+ databases undermana...
© 2012 – PythianWhen to Engage Pythian?Tier 3 DataLocal RetailerStrategic upside valuefrom dataTier 2 DataeCommerceTier 1 ...
© 2012 – Pythian4© 2012 PythianAlex Gorbachev• Chief Technology Officer at Pythian• Blogger• Cloudera Champion of Big Data...
© 2012 – PythianGiven enough skill and money –relational databases can do anything.Sometimes it’s just unfeasiblyexpensive.
What is Hadoop?
© 2012 – PythianHadoop PrinciplesBring Code to Data Share Nothing
© 2012 – PythianHadoop in a NutshellReplicated Distributed Big-Data File SystemMap-Reduce - framework forwriting massively...
© 2012 – PythianHDFS architecturesimplified view• Files are split in large blocks• Each block is replicated onwrite• A fil...
© 2012 – PythianHDFS design principles
© 2012 – PythianMap Reduce example histogramcalculation
© 2012 – PythianMap Reduce pros & consAdvantages• Simple programmingparadigm• Flexible• Highly scalable• Good fit for HDFS...
© 2012 – PythianSome components of Hadoopecosystem• Hive – HiveQL is SQL like query language• Generates MapReduce jobs• Pi...
© 2012 – PythianMapReduce is SLOW!Speed through massiveparallelization
© 2012 – PythianNon-MR processing on Hadoop• HBase – columnar-oriented key-value store (NoSQL)• SQL without Map Reduce• Im...
© 2012 – PythianHadoop Benefits• Reliable solution based on unreliable hardware• Designed for large files• Load data first...
© 2012 – Pythian• Hadoop is scalable but not fast• Some assembly required• Batteries not included• Instrumentation not inc...
© 2012 – PythianHow much does it cost?$300K DIY on SuperMicro• 100 data nodes• 2 name nodes• 3 racks• 800 Sandy Bridge CPU...
© 2012 – PythianHadoop vs. Relational DatabaseCoolLoad first, structure later“Cheap” hardwareDIYFlexible data storeEffecti...
Big Data and Hadoop Use Cases
© 2012 – PythianUse Cases for Big Data• Top-line contributors• Analyze customer behavior• Optimize ad placements• Customiz...
© 2012 – Pythian
© 2012 – PythianUse Cases for Big Data• Top-line contributors• Analyze customer behavior• Optimize ad placements• Customiz...
© 2012 – PythianUse Cases for Big Data• Bottom-line contributors• Cheap archives storage• ETL layer – transformation engin...
© 2012 – PythianTypical Initial Use-Cases forHadoop in modern Enterprise IT• Transformation engine (part of ETL)• Scales e...
© 2012 – PythianAdvanced: Data SciencePlatform• Data warehouse is good when questions are known, datadomain and structure ...
© 2012 – PythianPythian Internal Hadoop Use• OCR of screen video capture from Pythian privilegedaccess surveillance system...
© 2012 – Pythian2Thank you & Q&Ahttp://www.pythian.com/blog/http://www.facebook.com/pages/The-Pythian-Group/http://twitter...
Upcoming SlideShare
Loading in …5
×

Practical introduction to hadoop

2,501 views

Published on

In this slidecast, Alex Gorbachev from Pythian presents a Practical Introduction to Hadoop. This is a great primer for viewers who want to get the big picture on how Hadoop works with Big Data and how this approach differs from relational databases.

Watch the presentation: http://inside-bigdata.com/slidecast-a-practical-introduction-to-hadoop/
Download the audio:

Published in: Technology
  • Be the first to comment

Practical introduction to hadoop

  1. 1. Practical Introduction toHadoopAlex Gorbachev30-May-2013
  2. 2. © 2012 – PythianWho is Pythian?15 Years of Datainfrastructure managementconsulting170+ Top brands6000+ databases undermanagementOver 200 DBA’s, in 26countriesTop 5% of DBA work force, 9Oracle ACE’s, 2 MicrosoftMVP’sOracle, Microsoft, MySQL,Netezza, Hadoop, MongoDB,Oracle Apps, EnterpriseInfrastructure
  3. 3. © 2012 – PythianWhen to Engage Pythian?Tier 3 DataLocal RetailerStrategic upside valuefrom dataTier 2 DataeCommerceTier 1 DataHealth CareProfitLossImpact of an incident,whether it be data loss,security, human error,etc.Value of DataLOVE YOUR DATA
  4. 4. © 2012 – Pythian4© 2012 PythianAlex Gorbachev• Chief Technology Officer at Pythian• Blogger• Cloudera Champion of Big Data• OakTable Network member• Oracle ACE Director• Founder of BattleAgainstAnyGuess.com• Founder of Sydney Oracle Meetup• IOUG Director of Communities• EVP, Ottawa Oracle User Grou4
  5. 5. © 2012 – PythianGiven enough skill and money –relational databases can do anything.Sometimes it’s just unfeasiblyexpensive.
  6. 6. What is Hadoop?
  7. 7. © 2012 – PythianHadoop PrinciplesBring Code to Data Share Nothing
  8. 8. © 2012 – PythianHadoop in a NutshellReplicated Distributed Big-Data File SystemMap-Reduce - framework forwriting massively paralleljobs
  9. 9. © 2012 – PythianHDFS architecturesimplified view• Files are split in large blocks• Each block is replicated onwrite• A file can be only created anddeleted by one client• Uploading new data? => new file• Append supported in recent versions• Update data? => recreate file• No concurrent writes to a file• Clients transfer blocks directlyto & from data nodes• Data nodes use cheap localdisks• Local reads are the mostefficient
  10. 10. © 2012 – PythianHDFS design principles
  11. 11. © 2012 – PythianMap Reduce example histogramcalculation
  12. 12. © 2012 – PythianMap Reduce pros & consAdvantages• Simple programmingparadigm• Flexible• Highly scalable• Good fit for HDFS – mappersread locally• Fault tolerant• Task failure or node failuredoesn’t affect the whole job –they are restartablePitfalls• Low efficiency• Lots of intermediate data• Lots of network traffic onshuffle• Complex manipulationrequires pipeline of multiplejobs• No high-level language• Only mappers leveragelocal reads on HDFS
  13. 13. © 2012 – PythianSome components of Hadoopecosystem• Hive – HiveQL is SQL like query language• Generates MapReduce jobs• Pig – data sets manipulation language (like create yourown query execution plan)• Generates MapReduce jobs• Mahout – machine learning libraries• Generates MapReduce jobs• Oozie – workflow scheduler services• Sqoop – transfer data between Hadoop and relationaldatabase
  14. 14. © 2012 – PythianMapReduce is SLOW!Speed through massiveparallelization
  15. 15. © 2012 – PythianNon-MR processing on Hadoop• HBase – columnar-oriented key-value store (NoSQL)• SQL without Map Reduce• Impala (Cloudera)• Drill (MapR)• Phoenix (Salesforce.com)• Hadapt (commercial)• Shark – Spark in-memory analytics on Hadoop• Platfora (commercial)• In-memory analytics + visualization & reporting tool
  16. 16. © 2012 – PythianHadoop Benefits• Reliable solution based on unreliable hardware• Designed for large files• Load data first, structure later• Designed to maximize throughput of large scans• Designed to leverage parallelism• Designed to scale• Flexible development platform• Solution Ecosystem
  17. 17. © 2012 – Pythian• Hadoop is scalable but not fast• Some assembly required• Batteries not included• Instrumentation not includedeither• DIY mindset (rememberMySQL?)• Commercial distributions arenot free• Simplistic security modelsHadoop Limitations
  18. 18. © 2012 – PythianHow much does it cost?$300K DIY on SuperMicro• 100 data nodes• 2 name nodes• 3 racks• 800 Sandy Bridge CPUcores• 6.4 TB RAM• 600 x 2TB disks• 1.2 PB of raw disk capacity• 400 TB usable (triplemirror)• Open-source s/w, maybecommercial distribution
  19. 19. © 2012 – PythianHadoop vs. Relational DatabaseCoolLoad first, structure later“Cheap” hardwareDIYFlexible data storeEffectiveness via scalePetabytes100s – 1000s cluster nodes⇔Old⇔Structure first, load later⇔Enterprise grade hardware⇔Repeatable solutions⇔Efficient data store⇔Effectiveness via efficiency⇔Terabytes⇔Dozens of nodes (maybe)
  20. 20. Big Data and Hadoop Use Cases
  21. 21. © 2012 – PythianUse Cases for Big Data• Top-line contributors• Analyze customer behavior• Optimize ad placements• Customized promotions and etc• Recommendation systems• Netflix, Pandora, Amazon• New products and services• Prismatic, smart home
  22. 22. © 2012 – Pythian
  23. 23. © 2012 – PythianUse Cases for Big Data• Top-line contributors• Analyze customer behavior• Optimize ad placements• Customized promotions and etc• Recommendation systems• Netflix, Pandora, Amazon• New products and services• Prismatic, smart home
  24. 24. © 2012 – PythianUse Cases for Big Data• Bottom-line contributors• Cheap archives storage• ETL layer – transformation engine, data cleansing••••••••
  25. 25. © 2012 – PythianTypical Initial Use-Cases forHadoop in modern Enterprise IT• Transformation engine (part of ETL)• Scales easily• Inexpensive processing capacity• Any data source and destination• Data Landfill• Stop throwing away any data• Don’t know how to use data today? Maybe tomorrow you will• Hadoop is very inexpensive but very reliable
  26. 26. © 2012 – PythianAdvanced: Data SciencePlatform• Data warehouse is good when questions are known, datadomain and structure is defined• Hadoop is great for seeking new meaning of data, newtypes of insights• Unique information parsing and interpretation• Huge variety of data sources and domains• When new insights are found andnew structure defined, Hadoop oftentakes place of ETL engine• Newly structured information is thenloaded to more traditional data-warehouses (still today)
  27. 27. © 2012 – PythianPythian Internal Hadoop Use• OCR of screen video capture from Pythian privilegedaccess surveillance system• Input raw frames from video capture• Map-Reduce job runs OCR on frames and produces text• Map-Reduce job identifies text changes from frame to frame andproduces text stream with timestamp when it was on the screen• Other Map-Reduce jobs mine text (and keystrokes) for insights• Credit Cart patterns• Sensitive commands (like DROP TABLE)• Root access• Unusual activity patterns• Merge with monitoring and documentation systems
  28. 28. © 2012 – Pythian2Thank you & Q&Ahttp://www.pythian.com/blog/http://www.facebook.com/pages/The-Pythian-Group/http://twitter.com/pythian http://twitter.com/alexgorbachevhttp://www.linkedin.com/company/pythian1-866-PYTHIANsales@pythian.com gorbachev@pythian.comTo contact us…To follow us…

×