Контроль зверей: инструменты для управления и мониторинга распределенных систем от Cloudera. Александр Козлов, Cloudera Inc.

  • 10,124 views
Uploaded on

Александр Козлов, Cloudera Inc. …

Александр Козлов, Cloudera Inc.

Александр Козлов, старший архитектор в Cloudera Inc., работает с большими компаниями, многие из которых находятся в рейтинге Fortune 500, над проектами по созданию систем анализа большого количества данных. Закончил аспирантуру физического факультета Московского государственного университета, после чего также получил степень Ph.D. в Стэнфорде. До Cloudera и после окончания учебы работал над статистическим анализом данных и соответствующими компьютерными технологиями в SGI, Hewlett-Packard, а также стартапе Turn.

Тема доклада
Контроль зверей: инструменты для управления и мониторинга распределенных систем от Cloudera.

Тезисы
Поддержание распределенных систем, состоящих из тысяч компьютеров, является сложной задачей. Компания Cloudera, которая специализируется на создании распределенных технологий, разработала набор средств для централизованного управления распределенных Hadoop/HBase кластеров. Hadoop и HBase являются проектами Apache Software Foundation, и их применение для анализа частично структурированных данных ускоряется во всем мире. В этом докладе будет рассказано о SCM, системе для конфигурации, настройки, и управления Hadoop/HBase и Activity Monitor, системе для мониторинга ряда ОС и Hadoop/HBase метрик, а также об особенностях подхода Cloudera в отличие от существующих решений для мониторинга (Tivoli, xCat, Ganglia, Nagios и т.д.).

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
10,124
On Slideshare
0
From Embeds
0
Number of Embeds
8

Actions

Shares
Downloads
13
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Managing a Zoo: Tools for Managing andMonitoring Distributed Systems fromCloudera Alex Kozlov, Cloudera Inc.YaC, Москва, 19 сентября 2011 года
  • 2. Agenda• About Cloudera and myself• Background info – Data, data everywhere – Corporate data management, distributed systems, functional languages – Hadoop ecosystem• Distributed system maintenance – Installation/Updating/Monitoring • Fixed images • Standard configuration management tools – Our solution • Partial failures • Node cast • …• Implementation• What’s next2 ©2011 Cloudera, Inc. All Rights Reserved.
  • 3. About Cloudera Founded in the summer 2008 Cloudera’s mission is to help organizations profit from all of their data. Cloudera helps organizations profit from all of their data. We deliver the industry- standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses. Мы поставляем стандартные платформы для объединения, хранения и обрабатывания большого количества данных любого типа, от любого источника. Мы делаем это в масштабе большем чем когда- либо прежде. С Cloudera, вы получите лучшее понимание своих клиентов, партнеров, поставщиков и предприятий.3 ©2011 Cloudera, Inc. All Rights Reserved.
  • 4. Introduction # whoami # whoru alexvk – Sysadmin – Закончил ФизФак МГУ, – IT Manager Stanford University – TechOps – Работал в SGI, HP, Turn – Data Scientist – Senior Solutions Architect, – Researcher Cloudera, Inc. – Developer – CTO? … – Just curious?4 ©2011 Cloudera, Inc. All Rights Reserved.
  • 5. Data, data everywhere We are storing a lot more data: – 1 click on an average web-site generates about 100 lines of logs (somewhere) – 1 additional attribute/integer (8 bytes) means 1TB/day of data (from an ex-Google employee) 40-80PB stores are becoming common5 ©2011 Cloudera, Inc. All Rights Reserved.
  • 6. Corporate data management • Traditional • Future – Data from – EDW, centralized (SPoF) • Any source • Any kind • At scale – Fixed set of queries – Flexible insights (sales/revenue by quarter, etc.) • The value is not known beforehand • Multiple facets of deep, – ETL pipeline taking up to 24- exhaustive analysis hours to run – Interactive • 5 min delay from click to insights6 ©2011 Cloudera, Inc. All Rights Reserved.
  • 7. The Origins of Hadoop Open Source Open source web MapReduce and Releases Cloudera Hadoop tops Terabyte Enterprise 3.5 & SCM crawler project created HDFS project created sort benchmark by Doug Cutting by Doug Cutting Express2002 2004 2008 2010 2011 2012 Publishes MapReduce Runs 4,000-node Created Hive, adding Releases CDH3 and and GFS Paper Hadoop cluster SQL support Cloudera Enterprise 7 ©2011 Cloudera, Inc. All Rights Reserved.
  • 8. What is Hadoop? HDFS + MapReduce = HadoopHadoop is an ecosystem (HBase+ friends)Distributed storageMoving computations to the dataA new model for fault tolerance8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. CDH OverviewThe #1 commercial and non-commercialApache Hadoop distribution. File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE Data Integration Fast Read/Write Access APACHE FLUME, APACHE SQOOP APACHE HBASE Coordination APACHE ZOOKEEPER9 ©2011 Cloudera, Inc. All Rights Reserved.
  • 10. Landscape In 2008 (Cloudera founded) In 2011 – 3-5 companies, mostly in social – 100s of paying clients networking space, using – 2-3x growth in Hadoop Hadoop in production conference attendance year- – A lot of interest, but mostly for over-year the wrong reason – HBase, Oozie, Mahout – Biggest applications just smart – Lots of research (Spark, log processing Mesos, Low latency DFS/MR, – Largest installation in 10s of PB Graph algorithms) – Largest installations in 100s of PB10 ©2011 Cloudera, Inc. All Rights Reserved.
  • 11. Problem • Handling large data in distributed systems is uniquely challenging from an operational perspective. • Traditional approaches are valuable, but insufficient. Domain knowledge is vital. • Need for support within the frameworks themselves for operational concerns.11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 12. Datacenter(s) as a computer • Existing tools do not generalize well – Partial failure (how many machines might fail before the datacenter becomes non-operational? … about 50%) – Hadoop like metrics (data locality, # of slots, heartbeat delays) – Installation and lifecycle management – Heterogenious nodes • The ultimate user wants to USE the system, not CONFIGURE it – let insight = [ for i in my_smart_algos -> data |> i ]12 ©2011 Cloudera, Inc. All Rights Reserved.
  • 13. Why not machine images • Machines have complex state (config, local data) – Hard unless the state is trivial – Images need (rolling) upgrades – Machines can change (multiple) roles13 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. Why not config management tools • Make assumption of running M services on N machines, not X services running in the “cloud” – Very bad with “partial failures” – Don’t understand Hadoop specific “state” – Don’t understand Hadoop specific metrics14 ©2011 Cloudera, Inc. All Rights Reserved.
  • 15. Our solution • Managing partial failure – Cluster is still usable if x% fail (but might have a data loss if 3 nodes fail at the same time) – “Running with concerning health” • Node cast – Every node can be multiple things (think zoo: it can be a tiger or a lion or a monkey) • Finding nodes like one or jobs like one – Nodes are grouped according to functionality (datanode, tasktracker, regionserver, namenode, jobtracker) – Find jobs that are similar to a given one and track outliers • Drill down for Hadoop-specific diagnostic – Workflow -> Jobs -> Tasks -> Attempts15 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16. Details • Written in Python and Django • Each node runs “SCM agent” • Dial-in mode • Agent does the best effort to make the prescribed service(s) run • All state managed by the “server” • Diagnostic is passed via heartbeats • Centralized configuration management16 ©2011 Cloudera, Inc. All Rights Reserved.
  • 17. Services17 ©2011 Cloudera, Inc. All Rights Reserved.
  • 18. Services (partial failure)18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. New service19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. SCM node selection20 ©2011 Cloudera, Inc. All Rights Reserved.
  • 21. Visualising drill-down21 ©2011 Cloudera, Inc. All Rights Reserved.
  • 22. Job matching • Requires that we build up a rich model of job performance over time • Surprisingly subtle problem - how do we know when two jobs are the same? • Periodic jobs offer more clues - time of day, submitting user, map class, reduce class. • Query jobs are more difficult – For e.g. Hive, query string analysis can tell us something22 ©2011 Cloudera, Inc. All Rights Reserved.
  • 23. Job matching23 ©2011 Cloudera, Inc. All Rights Reserved.
  • 24. Diagnosing job performance • Ok, your job really is slow. What now? • Major cause of slowness, as seen by customers, is skew • Two predominant types of skew – Environmental skew, when identical tasks run differently depending on where they run. Breaks MR notion of homogeneity, causes severe slowdown. – Workload skew, when supposedly identical tasks have vastly differing amounts of work to do,24 ©2011 Cloudera, Inc. All Rights Reserved.
  • 25. Visualising skew25 ©2011 Cloudera, Inc. All Rights Reserved.
  • 26. What’s next • Cloudera Enterprise 3.5 & Hadoop express (June 2011, SCM & SCM Express) • Cloudera Enterprise 3.7 on the way26 ©2011 Cloudera, Inc. All Rights Reserved.
  • 27. Questions?Do not hesitate to email me alexvk at cloudera dot com27 ©2011 Cloudera, Inc. All Rights Reserved.