Dallas TDWI Meeting Dec. 2012: Hadoop

1,098 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,098
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Dallas TDWI Meeting Dec. 2012: Hadoop

  1. 1. © 2011 Radiant Advisors, All Rights Reserved. 1
  2. 2. © 2011 Radiant Advisors, All Rights Reserved. 2
  3. 3. © 2011 Radiant Advisors, All Rights Reserved. 3
  4. 4. © 2011 Radiant Advisors, All Rights Reserved. 4
  5. 5. © 2011 Radiant Advisors, All Rights Reserved. 5
  6. 6. Go check out: Data Processing with Hadoop: Scalable and CostEffective, Doug Cutting, Apache Hadoop Co-founder, April 26th, 2011This is the keynote presentation from Chicago Data Summit. DougCutting takes us through the creation of Apache Hadoop, Hadoopsadoption, the key advantages of Hadoop, and answers severalquestions from attendees.  http://www.cloudera.com/videos/chicago_data_summit_keynote_data_processing_with_hadoop_scalable_and_cost_effective_doug_cutting_apache_hadoop_co-founder_hadoop © 2011 Radiant Advisors, All Rights Reserved. 6
  7. 7. http://hadoop.apache.org/The project includes these subprojects:•  Hadoop Common: The common utilities that support the other Hadoop subprojects.•  Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.•  Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.Other Hadoop-related projects at Apache include:•  Avro™: A data serialization system.•  Cassandra™: A scalable multi-master database with no single points of failure.•  Chukwa™: A data collection system for managing large distributed systems.•  HBase™: A scalable, distributed database that supports structured data storage for large tables.•  Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.•  Mahout™: A Scalable machine learning and data mining library.•  Pig™: A high-level data-flow language and execution framework for parallel computation.•  ZooKeeper™: A high-performance coordination service for distributed applications. © 2011 Radiant Advisors, All Rights Reserved. 7
  8. 8. Reference: http://en.wikipedia.org/wiki/Apache_Hadoop © 2011 Radiant Advisors, All Rights Reserved. 8
  9. 9. Reference: Hadoop in Action, Chuck Lam, Manning Publications 2011.Hadoop cluster is a set of commodity machines networked together inone location.While not strictly necessary, machines in a Hadoop cluster are usuallyrelatively homogeneous x86 Linux boxes.And they’re almost always located in the same data center, often inthe same rack.Data storage and processing all occur with this “cloud” of machines.Different users can submit computing “jobs” to Hadoop from individualclients. © 2011 Radiant Advisors, All Rights Reserved. 9
  10. 10. © 2011 Radiant Advisors, All Rights Reserved. 10
  11. 11. © 2011 Radiant Advisors, All Rights Reserved. 11
  12. 12. Reference Information Week: Charles Babcock 06/22/2010 Designedfor cloud computing, the Hadoop data management system handlespetabytes of data at a time, pairing Googles MapReduce with adistributed file management system for use on large clusters.  Image Gallery: Yahoos Hadoop Implementationhttp://www.informationweek.com/news/galleries/software/info_management/225700411?pgno=1 © 2011 Radiant Advisors, All Rights Reserved. 12
  13. 13. © 2011 Radiant Advisors, All Rights Reserved. 13
  14. 14. © 2011 Radiant Advisors, All Rights Reserved. 14
  15. 15. © 2011 Radiant Advisors, All Rights Reserved. 15
  16. 16. © 2011 Radiant Advisors, All Rights Reserved. 16
  17. 17. © 2011 Radiant Advisors, All Rights Reserved. 17
  18. 18. © 2011 Radiant Advisors, All Rights Reserved. 18
  19. 19. © 2011 Radiant Advisors, All Rights Reserved. 19
  20. 20. © 2011 Radiant Advisors, All Rights Reserved. 20
  21. 21. © 2011 Radiant Advisors, All Rights Reserved. 21
  22. 22. © 2011 Radiant Advisors, All Rights Reserved. 22
  23. 23. © 2011 Radiant Advisors, All Rights Reserved. 23
  24. 24. © 2011 Radiant Advisors, All Rights Reserved. 24
  25. 25. © 2011 Radiant Advisors, All Rights Reserved. 25
  26. 26. © 2011 Radiant Advisors, All Rights Reserved. 26
  27. 27. © 2011 Radiant Advisors, All Rights Reserved. 27
  28. 28. © 2011 Radiant Advisors, All Rights Reserved. 28
  29. 29. © 2011 Radiant Advisors, All Rights Reserved. 29
  30. 30. http://www.informationweek.com/news/galleries/software/info_management/225700411?pgno=8Pig Parallel Programming LanguageOlga Natkovich, Pig engineering manager, and Alan Gates, Pig leadarchitect and a Pig contributor. Pig is a parallel programming languagedeveloped by Yahoo Research, the firms central research unit, whichallows Yahoo to easily perform procedural data processing tasks on topof Hadoop. It is the standard pipeline processing solution at Yahoo!SQL Example:------------------------------------------------------------------------------SELECT user, COUNT(*) FROM excite-small.log GROUP BY user;------------------------------------------------------------------------------In Pig becomes;------------------------------------------------------------------------------log = LOAD ‘excite-small.log’ AS (user, time, query);grpd = GROUP log BY user; © 2011 Radiant Advisors, All Rights Reserved. 30
  31. 31. Apache Hive Page:Hive is a data warehouse system for Hadoop that facilitates easy datasummarization, ad-hoc queries, and the analysis of large datasetsstored in Hadoop compatible file systems. Hive provides a mechanismto project structure onto this data and query the data using a SQL-likelanguage called HiveQL. At the same time this language also allowstraditional map/reduce programmers to plug in their custom mappersand reducers when it is inconvenient or inefficient to express this logic inHiveQL. © 2011 Radiant Advisors, All Rights Reserved. 31
  32. 32. Apache HBasePage: http://hbase.apache.org/       © 2011 Radiant Advisors, All Rights Reserved. 32
  33. 33. © 2011 Radiant Advisors, All Rights Reserved. 33
  34. 34. http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/Sanjay Sharma’s WeblogAugust 16, 2010Hadoop Ecosystem World-MapWhile preparing for the keynote for the  recently held HUG India meetup on 31st July, Idecided that I will try to keep my session short, but useful and relevant to the lined upsesssions on hiho, JAQL and Visual hive. I have always been a keen student ofgeography (still take pride in it!) and thought it would be great to draw a visualgeographical map of Hadoop ecosystem. Here is what I came up with a little nicestory behind it-1. How did it all start- huge data on the web!2. Nutch built to crawl this web data3. Huge data had to saved- HDFS was born!4. How to use this data?5. Map reduce framework built for coding and running analytics – java, anylanguage-streaming/pipes6. How to get in unstructured data – Web logs, Click streams, Apache logs, Server logs – fuse,webdav, chukwa, flume, Scribe7. Hiho and sqoop for loading data into HDFS – RDBMS can join the Hadoop bandwagon!8. High level interfaces required over low level map reduce programming– Pig, Hive,Jaql9. BI tools with advanced UI reporting- drilldown etc- Intellicus 10. Workflow tools over Map-Reduce processes and High level languages11. Monitor and manage hadoop, run jobs/hive, view HDFS – high level view- Hue,karmasphere, eclipse plugin, cacti, ganglia12. Support frameworks- Avro (Serialization), Zookeeper (Coordination)13. More High level interfaces/uses- Mahout, Elastic map Reduce14.  OLTP- also possible – Hbase © 2011 Radiant Advisors, All Rights Reserved. 34
  35. 35. © 2011 Radiant Advisors, All Rights Reserved. 35
  36. 36. © 2011 Radiant Advisors, All Rights Reserved. 36
  37. 37. © 2011 Radiant Advisors, All Rights Reserved. 37

×