Capital onehadoopintro

  • 680 views
Uploaded on

 

More in: Automotive
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
680
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Capital One Hadoop Intro: History ETL/Analytics Practices in LinkedIn/Netflix/Yahoo Next Gen ETL 2014+ Scaling Layers Hadoop Distributions Analytics 1/7/2014
  • 2. Hadoop/HBase  Original requirements − GFS: Storing internet html pages on disk for analytics later − BigTable: 2002/Book pages had metadata. Requirement return book pages to user, no joins (no memory 2002, different now)    Latency determines requirements (analytics/Netflix later) Semireal time. Schema for book pages. Where to store the metadata? In BigTable My role: not going to give you slides w/pics, everything presented has code behind it w/documentation
  • 3. Bigdata >>50% failure rate    After POCs very few enter production Why? Workhabits for distributed computing. Have to write distributed computing components, J2EE idioms don't work. Fail b/c Performance/Administration in Production e.g. Performance not an issue to support top 100 abinitio queries in Hadoop, 130k will be issue or perhaps 10%
  • 4. Measuring performance in POCS, wrong means they can't build components  Wrong Server/Thre ad DN1/RS NN DN2/RS Server/Thre ad DN2/RS
  • 5. Performance Measurement, leader election, countdown latch, test failure/handoff w/chaos monkey  Zookeeper+Jetty DN1/RS Server DN1/RS Zookeeper Server DN1/RS
  • 6. Hive at LinkedIn (bottom left). All 3 similar
  • 7. Linkedin Simple Abstractions  Teradata with Hadoop  Multiple clusters:Prod/Dev/Research(POC?)  Hive: adhoc small ETL lower left hand corner   Pig/DataFu + enhancements for ETL production Multiple data stages in green box, (POC Abinitio Datastaging, REST API for staging).  Workflow POC; Oozie+Pig+Hive. Add Web UI  Data Staging POC: CDK as example
  • 8. POC Coding Style    High level directory with Maven subprojects, Simple Archetype ok Define Data Repositories with Avro schemas, start with a simple file repository with files copied from Abinitio file system. No need to spend time reverse engineering; just copy Add pig and hive directories to cdk-examples
  • 9. POC Simple extensions     Define a webserver in the cdk and create a REST API. Jersey/.../DI if you want more advanced coding styles Webserver graphs performance of Hive/Pig/ETL metrics with JVM metrics and by sending dummy queries in. Start Nagios/Ganglia monitoring and Puppet deployment of CDK as learning for larger scale Integrate CDK into Bigtop for Capital One distribution practice
  • 10. Netflix, Block Diagram
  • 11. Simple Netflix Abstractions   http://www.slideshare.net/adrianco/netflix-architectu Automated Develop and deploy s/w process on APIs. Perforce/Ivy/Jenkins. Hadoop POC, github, Jenkins, deploy to demo webpage. No code sitting in an Eclipse project
  • 12. Netflix Automated App Dev/Deploy  REST specification makes Web Uis easier. C1 ETL REST I/F
  • 13. Netflix Instance config  Do same for Capital One, exercise to help w/deployment; Apache Bigtop, define 1) NN instance, 2) DN/RS instance, customize the scripts/instance
  • 14. Netflix Security  Default turn off iptables/selinux. Define Capital One POC testing? Start w/auditing requirements on test cluster (w/Aravind )
  • 15. Netflix Metrics  Send dummy queries through to measure latency
  • 16. Netflix Scaling Layer, do simpler first, JDBC manage connection pool,Pig/Hive
  • 17. Yahoo Block Diagram, Pig, Hive, Spark, Storm
  • 18. Yahoo Spark (cont)
  • 19. Yahoo Next Gen ETL
  • 20. LinkedIn/Yahoo/Netflix References    Reference: LinkedIn: Muhammed Islam http://www.slideshare.net/mislam77/hive-at-linkedin Yahoo:Chris Drome, for outside business users. Very similar to slide before. Netflix: Jeff Magnusson Hive used for adhoc queries and lightweight ETL (on web also)
  • 21. ETL - Pig  Original Pig paper: http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf − ETL language based on relational algebra (reorder/Set) vs. SQL queries. Each step M/R ETL − No transactional consistency or indexes (other projects have this) − Nested Data model vs. Flat SQL E/R model. Why? Faster scan performance, replace joins. e.g.MongoDB Requires development UDFs, LinkedIn: DataFu    Netflix: Lipstick for debugging Pig DAGs. Will need some debugging tool. Better than Spill
  • 22. ETL - Pig  M/R ETL Points − Data distributed on several nodes, merge sort results at end. Careful sending data across the network. Doesn't scale with more users. Network limitation − Google custom network switch, 1k+ ports. Custom TCP stack, modified OS − Careful: streams scale, do ETL with Streams. Real time performance. Send results to a separate server. Do not embed writes into stream POCs
  • 23. Pig vs M/R
  • 24. Pig Usage    Yahoo(http://www.linkedin.com/pub/chrisdrome/2/a2/346): thousands of ETL jobs daily, Hive for small user base external to Yahoo Netflix(http://www.linkedin.com/in/jmagnuss): Thousands of jobs, at analyst level. Open sourced Lipstick, Pig UI debugging tool LinkedIn(http://www.slideshare.net/hadoopuser group/pig-at-linkedin): thousands of jobs, open sourced DataFu UDFs
  • 25. PIG POCS(~2009)  Possible Pig POCs: − Top XX queries, manually code up Abinitio queries. This is already completed 2012? Which queries? − Add a JDBC connection type scaling layer to PigServer.java − Out of scope for 4/30/14:   POC Tez on Pig: https://issues.apache.org/jira/browse/PIG-3446 Apache's Pig Optimizer (MR->MR->MR goes to MRRR) by writing optimizer in YARN AM.
  • 26. POC quality  Turn the POCs into Bigtop integration tests and get open source approval. Commit changes to verify quality and accountability
  • 27. Hive 0.11     More difficult to configure, add mysql metastore Moving to Hcatalog for metadata to be accessible by other Hadoop Components Access using WebHCat, in progress Hive Stinger using TEZ, additional in memory optimization No time spent on this yet; starting 1/2014 w/Hortonworks. Last day 4/30/2014
  • 28. Hive 0.11 − Hive 0.11 POCs       User guide for Abinitio programmers using Hive/Pig Test multitenancy features w/Pig/HDFS Test jdk 1.7 features. Hadoop 2.x works with 1.7 HiveMetastore/MySQL/HCatalog/HWebCat Test cluster performance using benchpress Next gen: 0.12-0.13;Spark/Shark hiveql compatability
  • 29. Next Gen ETL Frameworks for 2014+  Faster Reads/Scans w/o using HBase. 3 Developments(wibidata) − − Spark/Shark −  Dremel:Impala/Apache Drill Hive/Tez Dremel Paper review, Interactive analysis of Web Scale datasets − Don't use M/R for speed, 100x faster − Column schema: Nested Column oriented storage, not rows, faster for some types of queries!!! − Partition key (not in paper)
  • 30. Next Gen ETL Frameworks for 2014+  Faster Reads/Scans w/o using HBase. 3 Developments(wibidata) − − Spark/Shark −  Dremel:Impala/Apache Drill Hive/Tez Dremel Paper review, Interactive analysis of Web Scale datasets − Don't use M/R for speed, 100x faster − Column schema: Nested Column oriented storage, not rows, faster for some types of queries!!! − Partition key (not in paper)
  • 31. Dremel Schema/Column Perf, sim to kiji w/o Hbase? Sqoop objects
  • 32. Impala/Drill POC
  • 33. Next Gen ETL   Shark/Spark; distributed memory RDD, analysis and ETL Hive/Tez
  • 34. Next Gen ETL POCs (combine mem)   Goal: develop skill for getting to higher Read HDFS performance. Stage Data Schema/Representation effects on Performance. Dremel nested columns: − −  Data w/ avro schemas and partition strategies. Partition by timestamp, partition by custom rowkey, partition by schema definitions Measure effect of data schema on M/R and nonM/R implementations. Conversion or staging process for data
  • 35. Next Gen ETL  Addition of new components into Hadoop − CDH will come with Spark/Shark − CDH comes with Impala − HDP status unknown for now (clear EOM)
  • 36. Hadoop Distributions  Create a Capital One distribution  Why? Production is 3-4x the amount of work compared to Dev −      Make sure ready for production before development completed Refactoring of scripts, bin and sbin to allow admin and users access to admin/user scripts Customize and Add components, (scaling layer) Puppet/Chef scripts for cluster deployment Real Time Monitoring(not provided in CDH/HDP), hotspot detection for long running jobs Ready for cluster deployment allows integration of functional requirements like security into functional Groovy iTests.
  • 37. Possible Hadoop Distro POCs  Beginner POCs:  Goal: smooth handoff from dev to production − Build Apache Bigtop (will need reference doc) − Add components you are currently using not in distro (e.g. mongodb + hbase for schema) − Add integration tests, − Add puppet recipes − Learn how to apply patches, how to customize for simple modifications, production stability
  • 38. POC framework   Goal: contribute open source code Start with the documentation and s/w processes first − DocBook; − Jenkins server; http://apachebigtop.pbworks.com/w/file/49310946/A pache%20Bigtop%20%20Jenkins.docx
  • 39. POC Framework/Roadmap  Track the Jiras!!! − Multitenancy needs a test plan.  − Development environment using Vagrant instead of EC2. Cheaper, easier to administer  − https://issues.apache.org/jira/browse/BIGTOP-1171 Create a Capital One Hadoop* user guide  − https://issues.apache.org/jira/browse/BIGTOP-1136 https://issues.apache.org/jira/browse/BIGTOP-1157 Create a functional spec for missing components  Include test cases for security, multiuser access, minimum performance to meet SLAs
  • 40. Scaling  Astyanax on Cassandra (Netflix) −   Small companies don't have 300 users accessing HDFS. Manage the clients. Some examples. Scaling involves multiple components above the cluster h/w and Hadoop daemons. This is NOT running CDH or HDP using Ambari or Cloudera Manager Gives SLA and Adhoc high priority jobs
  • 41. Capital One will need a custom component   Either for Security or scaling or … even to separate batch analytics queries from adhoc queries Break down into 2 bigger steps: − Cluster Testing tool for scaling/security − Develop multiuser client layer using above and measure performance and modified use cases
  • 42. Building a scaling layer  Need a tool for testing. Need to know how to use zookeeper at a minimum. − Impossible to figure out via web searches − Leader election and countdown latch − Most people do their POCs incorrectly.    Worst mistake is multiple threads on a single server Second worst mistake is using HBase PerformanceEvaluation.java as a reference. PE.java is not cluster aware Test cluster throughput for cluster scaling
  • 43. Analytics    Review and Demo (weblog targeting) Concepts to agree on first: modeling and targeting http://www.slideshare.net/DougChang1/demogr aphics-andweblogtargeting-10757778
  • 44. Analytics, (wibidata), schema, model, targeting, use db vs hbase
  • 45. Analytics f(latency). Netflix
  • 46. Analytics  Model iteration performance key. O(n^2) # users −   Random Forest 6-8h on macbook Sponsorship from EMC, free 1k node cluster + Gemfire for faster model building Hadoop;HDFS + M/R for certain specific use cases − Batch analysis, log analysis. Click log analysis from large disk files − ETL, M/R ETL only. Much much slower than any commercial system
  • 47. Analytics 2014+  Visualizations −  Deep Learning case studies: −  Tableau/Datameer POC? Data+Queries? Google Now >> Apple Siri. Deep Learning models replaced Gaussian MM Background refresher speech recognition − Deep learning as a replacement for GMMs in the Acoustic model, http://www.stanford.edu/class/cs224s/2006/ − Can do POCs here for innovation. Requires outside consultant assistance
  • 48. Deliverables avail today  Start the Capital One distribution − −  Build instructions Functional Specification Capital One Hadoop Distro POC Planned, need approval before starting − Data Staging    Functional Specification Capital One Data Staging POC Functional Specification Data Staging API ETL Performance POC − Functional Specification Top 100 queries from Abinitio
  • 49. Capital One Block Diagram REST: Batch ETL M/R REST: AdHoc M/R Real Time ETL No M/R Streams/Storm Real Time Anaytics HCatalog/Schema Scaling Layer HDFS
  • 50. POCs   Data Ingestion: POC w/Apache Kafka; test fixture needed. Current abilities may not be there Hadoop ETL: − Schema definition  Write/Read query performance of top 10/100 Abinitio queries. How close is current ETL to Abinitio? Assume this answer exists.
  • 51. POCs   Hadoop Dev->Production: Building Capital One distribution Apache Bigtop, replicate CDH configuration with HDFS/Pig/Hive/OOzie/Flume/Spark. Leave out Impala, not currently in Bigtop Scaling: POC intermediate layer.