Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Kudu - Updatable Analytical Storage #rakutentech

4,960 views

Published on

https://rakutentechnologyconference2017.sched.com/speaker/shoshimauchi

Published in: Technology

Apache Kudu - Updatable Analytical Storage #rakutentech

  1. 1. 1© Cloudera, Inc. All rights reserved. Apache Kudu Updatable Analytical Storage for Modern Data Platform Sho Shimauchi | Sales Engineer | Cloudera
  2. 2. 2© Cloudera, Inc. All rights reserved. Who Am I? Sho Shimauchi Sales Engineer / Technical Evangelist Joined Cloudera in 2011 The First Employee in Cloudera APJ Email: sho@cloudera.com Twitter: @shiumachi
  3. 3. 3© Cloudera, Inc. All rights reserved. •  Founded in 2008 •  1600+ Clouderans •  Machine learning and analytics platform •  Shared data experience •  Cloud-native and cloud-differentiated •  Open-source innovation and efficiency
  4. 4. 4© Cloudera, Inc. All rights reserved. Rakuten Card replaced Mainframe to Cloudera Enterprise in 2017 Apache Spark improved performance of the batch processes >2x Please join Cloudera World Tokyo 2017 to see Kobayashi-san’s Keynote! www.clouderaworldtokyo.com Rakuten Card + Cloudera
  5. 5. 5© Cloudera, Inc. All rights reserved. Why Kudu? Use Cases and Motivation
  6. 6. 6© Cloudera, Inc. All rights reserved. 6 The modern platform for machine learning and analytics optimized for the cloud EXTENSIBLE SERVICES CORE SERVICES DATA ENGINEERING OPERATIONAL DATABASE ANALYTIC DATABASE DATA CATALOG INGEST & REPLICATION SECURITY GOVERNANCE WORKLOAD MANAGEMENT DATA SCIENCE NEW OFFERINGS Cloudera Enterprise Amazon S3 Microsoft ADLS HDFS KUDU STORAGE SERVICES
  7. 7. 7© Cloudera, Inc. All rights reserved. HDFS Fast Scans, Analytics and Processing of Stored Data Fast On-Line Updates & Data Serving Arbitrary Storage (Active Archive) Fast Analytics (on fast-changing or frequently-updated data) Unchanging Fast Changing Frequent Updates HBase Append-Only Real-Time Kudu Kudu fills the Gap Modern analytic applications often require complex data flow & difficult integration work to move data between HBase & HDFS Analytic Gap Pace of Analysis PaceofData Filling the Analytic Gap
  8. 8. 8© Cloudera, Inc. All rights reserved. Apache Kudu: Scalable and fast structured storage Scalable •  Tested up to 300+ nodes (PBs cluster) •  Designed to scale to 1000s of nodes and tens of PBs Fast •  Multiple GB/second read throughput per node •  Millions of read/write operations per second across cluster Tabular •  Represents data in structured tables like a relational database •  Strict schema, finite column count, no BLOBs •  Individual record-level access to 100+ billion row tables
  9. 9. 9© Cloudera, Inc. All rights reserved. Apache Kudu Community
  10. 10. 10© Cloudera, Inc. All rights reserved. Can you insert time series data in real time? How long does it take to prepare it for analysis? Can you get results and act fast enough to change outcomes? Can you handle large volumes of machine-generated data? Do you have the tools to identify problems or threats? Can your system do machine learning? How fast can you add data to your data store? Are you trading off the ability to do broad analytics for the ability to make updates? Are you retaining only part of your data? Time Series Data Machine Data Analytics Online Reporting Why Kudu?
  11. 11. 11© Cloudera, Inc. All rights reserved. Cheaper and faster every year. Persistent memory (3D XPoint™) Kudu can take advantage of SSD and NVM using Intel’s NVM Library. RAM is cheaper and bigger every day. Kudu runs smoothly with huge RAM. Written in C++ to avoid GC issues. Modern CPUs are adding cores and SIMD width, not GHz. Kudu takes advantage of SIMD instructions and concurrent data structures. Next generation hardware Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs
  12. 12. 12© Cloudera, Inc. All rights reserved. How it Works Replication And Fault Tolerance
  13. 13. 13© Cloudera, Inc. All rights reserved. Tables, tablets, and tablet servers • Each table is horizontally partitioned into tablets • Range or hash partitioning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • Each tablet has N replicas (3 or 5) with Raft consensus • Automatic fault tolerance • MTTR (mean time to repair): ~5 seconds
  14. 14. 14© Cloudera, Inc. All rights reserved. Metadata Replicated master Acts as a tablet directory Acts as a catalog (which tables exist, etc) Acts as a load balancer (tracks TS liveness, re-replicates under- replicated tablets) Caches all metadata in RAM for high performance Client configured with master addresses Asks master for tablet locations as needed and caches them
  15. 15. 15© Cloudera, Inc. All rights reserved. Client Hey Master! Where is the row for ‘tlipcon’ in table “T”? It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, … UPDATE tlipcon SET col=foo Meta Cache T1: … T2: … T3: …
  16. 16. 16© Cloudera, Inc. All rights reserved. Raft consensus TS A Tablet 1 (LEADER) Client TS B Tablet 1 (FOLLOWER) TS C Tablet 1 (FOLLOWER) WAL WALWAL 2b. Leader writes local WAL 1a. Client->Leader: Write() RPC 2a. Leader->Followers: UpdateConsensus() RPC 3. Follower: write WAL 4. Follower->Leader: success 3. Follower: write WAL 5. Leader has achieved majority 6. Leader->Client: Success!
  17. 17. 17© Cloudera, Inc. All rights reserved. How it Works Columnar Storage
  18. 18. 18© Cloudera, Inc. All rights reserved. Row Storage Scans have to read all the data, no encodings {23059873, newsycbot, 1442865158, Visual exp…} {22309487, RideImpala, 1442828307, Introducing …} … Tweet_id, user_name, created_at, text
  19. 19. 19© Cloudera, Inc. All rights reserved. {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text Columnar Storage
  20. 20. 20© Cloudera, Inc. All rights reserved. SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’; {25059873, 22309487, 23059861, 23010982} Tweet_id 1GB {newsycbot, RideImpala, fastly, llvmorg} User_name Only read 1 column 2GB {1442865158, 1442828307, 1442865156, 1442865155} Created_at 1GB {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text 200GB Columnar Storage
  21. 21. 21© Cloudera, Inc. All rights reserved. {1442825158, 1442826100, 1442827994, 1442828527} Created_at Created_at Diff(created_at) 1442825158 n/a 1442826100 942 1442827994 1894 1442828527 533 64 bits each 11 bits each Columnar Compression Many columns can compress to a few bits per row! Especially: Timestamps Time series values Low-cardinality strings Massive space savings and throughput increase!
  22. 22. 22© Cloudera, Inc. All rights reserved. How it Works Write and Read Paths
  23. 23. 23© Cloudera, Inc. All rights reserved. LSM vs Kudu LSM – Log Structured Merge (Cassandra, HBase, etc) Inserts and updates all go to an in-memory map (MemStore) and later flush to on-disk files (SSTable, HFile) Reads perform an on-the-fly merge of all on-disk HFiles Kudu Shares some traits (memstores, compactions) More complex. Slower writes in exchange for faster reads (especially scans)
  24. 24. 24© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” flush
  25. 25. 25© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah2” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“blah2” Row=r2 col=c2 val=“2” flush HFile 1Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”
  26. 26. 26© Cloudera, Inc. All rights reserved. LSM Update path MemStore UPDATE HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Note: all updates are “fully decoupled” from reads. Random-write workload is transformed to fully sequential!
  27. 27. 27© Cloudera, Inc. All rights reserved. LSM Read path MemStore HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Merge based on string row keys R1: c1=blah c2=2 R2: c1=newval c2=5 …. CPU intensive! Must always read rowkeys Any given row may exist across multiple HFiles: must always merge! The more HFiles to merge, the slower it reads
  28. 28. 28© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet INSERT(“todd”, “$1000”,”engineer”) name pay role DiskRowSet 1 flush Multiple files for each columns base data Latest version of data
  29. 29. 29© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 INSERT(“doug”, “$1B”, “Hadoop man”) flush base data base data
  30. 30. 30© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 DeltaMemStore DeltaMemStore base data base data On MemoryOn Disk On Memory
  31. 31. 31© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 DeltaMemStore DeltaMemStore UPDATE set pay=“$1M” WHERE name=“todd” Is the row in DiskRowSet 2? (check bloom filters) Is the row in DiskRowSet 1? (check bloom filters) Bloom says: no! Bloom says: maybe! Search key column to find offset: rowid = 150 150: col 1=$1M base data
  32. 32. 32© Cloudera, Inc. All rights reserved. Kudu storage – Delta flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 DeltaMemStore DeltaMemStore 0: pay=fooREDO DeltaFile Flush A REDO delta indicates how to transform between the ‘base data’ (columnar) and a later version base data base data
  33. 33. 33© Cloudera, Inc. All rights reserved. Kudu storage – Minor delta compaction name pay role DiskRowSet(pre-compaction) Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile REDO DeltaFile base data
  34. 34. 34© Cloudera, Inc. All rights reserved. Kudu storage – Major delta compaction name pay role DiskRowSet Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile Unmerged REDO DeltaFile base data pay Compaction can be performed only on high-frequent column UNDO Records UNDO stores previous versions of data
  35. 35. 35© Cloudera, Inc. All rights reserved. Kudu storage – RowSet Compactions DRS 1 (32MB) [PK=alice], [PK=iris], [PK=linda], [PK=zach] DRS 2 (32MB) [PK=bob], [PK=jon], [PK=mary] [PK=zeke] DRS 3 (32MB) [PK=carl], [PK=julie], [PK=omar] [PK=zoe] DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB) [alice, bob, carl, iris] [jon, julie, linda, mary] [omar, zach, zeke, zoe] Writes for “chris” have to perform bloom lookups on all 3 RS Range: A-Z Range: A-Z Range: A-Z Range: A-I Range: J-M Range: O-Z Reorganize rows to avoid rowsets with overlapping key ranges “chris” is in this range!
  36. 36. 36© Cloudera, Inc. All rights reserved. Kudu Storage - Compactions Main Idea: Always be compacting! Compactions run continuously to prevent IO storms ”Budgeted” RS compactions: What is the best way to spend X MBs IO? Physical/Logical decoupling: different replicas run compactions at different times
  37. 37. 37© Cloudera, Inc. All rights reserved. Kudu storage – Read path MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 DeltaMemStore DeltaMemStore 150: pay=$1M base data base data Just need to read this DiskRowSet!
  38. 38. 38© Cloudera, Inc. All rights reserved. Kudu storage – Time Travel Read name pay role DiskRowSet Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile base data pay UNDO Records T=0: a query starts to read “pay” in other DiskRowSet T=10: major delta compaction happened! Base file is updated, and UNDO is created T=20: the query starts to read “pay” in this DiskRowSet, but read the version of T=0 from UNDO Records
  39. 39. 39© Cloudera, Inc. All rights reserved. Takeaways
  40. 40. 40© Cloudera, Inc. All rights reserved. Getting Started On the web: https://www.cloudera.com/documentation/kudu/latest.html, https://www.cloudera.com/downloads.html, https://blog.cloudera.com/?s=Kudu, kudu.apache.org •  Apache project user mailing list: user@kudu.apache.org •  Quickstart VM •  Easiest way to get started •  Impala and Kudu in an easy-to-install VM •  CSD and Parcels •  For installation on a Cloudera Manager-managed cluster Training classes available: https://www.cloudera.com/more/training.html
  41. 41. 41© Cloudera, Inc. All rights reserved. Nov 7, 2017 Tue ANA Intercontinental Hotel Estimated Attendees #: 1000 E-1: Apache Kudu on Analytical Data Platform Register Now! www.clouderaworldtokyo.com Cloudera World Tokyo 2017
  42. 42. 42© Cloudera, Inc. All rights reserved. Thank you sho@cloudera.com

×