Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) - AWS re:Invent 2018

980 views

Published on

Flexibility is key when building and scaling a data lake. The analytics solutions you use in the future will almost certainly be different from the ones you use today, and choosing the right storage architecture gives you the agility to quickly experiment and migrate with the latest analytics solutions. In this session, we explore best practices for building a data lake in Amazon S3 and Amazon Glacier for leveraging an entire array of AWS, open source, and third-party analytics tools. We explore use cases for traditional analytics tools, including Amazon EMR and AWS Glue, as well as query-in-place tools like Amazon Athena, Amazon Redshift Spectrum, Amazon S3 Select, and Amazon Glacier Select.

  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) - AWS re:Invent 2018

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lake Implementation Processing and Querying Data in Place S T G 2 0 4 John Mallory Storage Business Development Manager AWS/BD Gene Stevens CTO & Co-Founder ProtectWise Joshua Hollander Principal Software Engineer ProtectWise
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Finding Value in Data is a Journey Business Monitoring Business Insights New Business Opportunity Business Optimization Business Transformation Evolving Tools and Methods AI/MLSQL Query
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why Use AWS for Big Data? Agility Scalability Get to Insights Faster Broadest and Deepest Capabilities Low Cost Data Migrations Made Easy
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Defining the AWS Data Lake Data lake is an architecture with a virtually limitless centralized storage platform capable of categorization, processing, analysis, and consumption of heterogeneous datasets Key data lake attributes • Decoupled storage and compute • Rapid ingest and transformation • Secure multi-tenancy • Query in place • Schema on read
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. User-Defined Functions • Bring your own functions & code • Execute without provisioning servers Processing and Querying In Place Fully Managed Process & Query • Catalog, Transform, & Query Data in Amazon S3 • No physical instances to manage Lambda Function
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Central Storage Catalog Processing and Analytics Access and Secure Example of AWS Services for Data LakeIngest Methods
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. High Performance Why Amazon S3 for the Data Lake? SecureDurable Available Easy to use Scalable & Affordable Integrated
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimize Costs with Data Tiering Hot Cold Amazon S3 Standard Amazon S3 - Infrequent Access Amazon Glacier HDFS  Use EMR/Hadoop with local HDFS for hottest datasets  Store cooler data in Amazon S3 and cold in Amazon Glacier to reduce costs  Use S3 Analytics to optimize tiering strategy Amazon S3 Analytics
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Data Lake Needs to Accommodate a Wide Variety of Concurrent Data Sources Rapidly Ingest All Data Sources IoT, Sensor Data, Clickstream Data, Social Media Feeds, Streaming Logs Oracle, MySQL, MongoDB, DB2, SQL Server, Amazon RDS On-premises ERP, Mainframes, Lab Equipment, NAS Storage Offline Sensor Data, NAS, On-premises Hadoop On-premises data lakes, EDW, Large-Scale Data Collection Ingest Methods
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ingest Optimization Considerations Separate ingest buckets from Amazon S3 data lake buckets • Prepare data before loading into data lake • Preserve raw assets (potentially lifecycle to Amazon Glacier) Aggregate smaller files/objects ahead of Amazon S3 where possible • Reduces transaction costs • Avoids TPS limits (can also pre-partition S3 bucket) Consider AWS Storage Gateway/AWS Snowball Edge for file data • Converts files to S3 objects, while preserving metadata Build Automated Data Pipelines • AWS Lambda
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Kinesis—Real Time Easily collect, process, and analyze video and data streams in real time Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics SQL
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common AWS Data Pipeline Configuration Raw Data Amazon S3 Highly decoupled configurations scale better, are more fault tolerant, and cost optimized ETL (Hadoop) Amazon EMR Triggered Code AWS Lambda Staged Data (Data Lake) Amazon S3 ETL & Catalog Management AWS Glue Data Warehouse Amazon Redshift Triggered Code AWS Lambda
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process Data in Place… Amazon S3 Amazon Athena Amazon Redshift Spectrum Amazon SageMaker AWS Glue
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 Select and Amazon Glacier Select Select subset of data from an object based on a SQL expression
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Motivation Behind Amazon S3 Select GET all the data from S3 objects, and my application will filter the data that I need Redshift Spectrum Example: Customer: Run 50,000 queries Amount of data fetched from S3: 6 PBs Amount of data used in Amazon Redshift: 650 TB Data needed from S3: 10%
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 Select Output Format: delimited text (CSV, TSV), JSON … Clauses Data types Operators Functions Select String Conditional String From Integer, Float, Decimal Math Cast Where Timestamp Logical Math Boolean String (Like, ||) Aggregate Input Format: delimited text (CSV, TSV, JSON, Parquet… Compression: GZIP, BZIP2 …
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Before 200 seconds and 11.2 cents # Download and process all keys for key in src_keys: response = s3_client.get_object(Bucket=src_bucket, Key=key) contents = response['Body'].read() for line in contents.split('n')[:-1]: line_count +=1 try: data = line.split(',') srcIp = data[0][:8] …. Amazon S3 Select: Serverless MapReduce After 95 seconds and costs 2.8 cents # Select IP Address and Keys for key in src_keys: response = s3_client.select_object_content (Bucket=src_bucket, Key=key, expression = SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj) contents = response['Body'].read() for line in contents: line_count +=1 try: …. 2X Faster at 1/5 of the cost
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Before Amazon S3 Select: Accelerating Big Data After After 5X Faster with 1/40 of the CPU EMR 5.18.0
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Choosing the Right Data Formats There is no such thing as the “best” data format • All involve tradeoffs, depending on workload & tools • CSV, TSV, JSON are easy, but not efficient • Compress & store/archive as raw input • Columnar compressed are generally preferred • Parquet or ORC • Smaller storage footprint = lower cost • More efficient scan & query • Row oriented (AVRO) good for full data scans Key considerations are cost, performance & support
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Choosing the Right Data Formats (con’t.) Pay by the amount of data scanned per query Use Compressed Columnar Formats • Parquet • ORC Easy to integrate with wide variety of tools Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Prep is ~80% of Data Lake Work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue—Serverless Data Catalog & ETL Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python and Spark Automatically discovers data and stores schema Data searchable, and available for ETL Generates customizable code Schedules and runs your ETL jobs Serverless
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Athena—Interactive Analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Supports Multiple Data Formats – Define Schema on Demand $ Query Instantly Pay per query Open Easy
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Spectrum E x t e n d t h e d a t a w a r e h o u s e t o e x a b y t e s o f d a t a i n S 3 d a t a l a k e S3 data lake Amazon Redshift data Redshift Spectrum query engine Exabyte Amazon Redshift SQL queries against S3 Join data across Amazon Redshift and Amazon S3 Scale compute and storage separately Stable query performance and unlimited concurrency CSV, ORC, Grok, Avro, & Parquet data formats Pay only for the amount of data scanned
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. End-to-End Machine Learning Platform Zero setup Flexible Model Training Pay by the second Amazon SageMaker The quickest and easiest way to get ML models from idea to production $
  26. 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Top Data Lake Use Case Scenarios Data Warehouse Modernization Protect Your On-premises Data Lake Real Time Analytics Business Intelligence & Data Exploration AI/Machine Learning
  27. 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Case Study: ProtectWise Gene Stevens CTO & Co-Founder ProtectWise Joshua Hollander Principal Software Engineer ProtectWise
  28. 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. ProtectWise Overview ● Cloud-Delivered Network Detection & Response (NDR) platform ● >500 terabytes of data analyzed/day ● >10M transactions/second ● Petabytes at rest
  29. 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why AWS Core strategy ● Time to market ● COGS innovation as foundational ● Evolutionary architecture ● Continuous learning ● Human time as most critical ● AWS roadmap ahead of industry Innovation as first principle Or, how to be as good at what you don’t know as what you do know
  30. 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why AWS ● Rate of innovation is king ● Easily introduce new technologies ● Ability to surgically target optimizations ● Must be ahead of feature and volume growth ● Transactional costs must continuously trend down Building your own infrastructure puts core strategy at risk COGS innovation is foundational
  31. 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  32. 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is Explorer? Interactively search entire network timeline What IPs did 172.20.243.225 talk to using BitTorrent in the last 30 days? Hunt for threats Did device X make any HTTP requests with a User Agent associated with known malware? Visualize
  33. 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Explorer
  34. 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Explorer 1.0 Before Explorer: ● Used Apache Cassandra ● Used for just KV lookups Explorer required: ● Searching ● Sorting ● Faceting
  35. 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Choosing DataStax Enterprise Edition ● Already using Cassandra ● Needed searchability ● Not all of this data is immutable ● Time to market prioritized over cost
  36. 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Challenges with initial solution ● Operational ○ Performance tuning ○ Scaling ○ Data retention ● Cost ○ Licensing ○ Instances ○ Storage ○ Operational
  37. 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  38. 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Possibilities ● Cassandra + Elasticsearch ● NoSQL solution “X” ● Hive ● Presto Want the cost and scale profile of S3-based warehouse solutions but with similar query performance to DSE.
  39. 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hybrid custom + off-the-shelf
  40. 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recipe ● Inspired by Hive, Presto et al ● Leverage existing tools Combine with secret ingredient...
  41. 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Secret ingredient: Probabilistic meta indexes ● Hive already has Bloom filter support via ORC on block level Bloom filters!!! ● Our customization: the metastore is a Bloom filter!
  42. 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A “searchable” Bloom filter Term Doc IDs 0 0,1,2 1 1,2 2 1 3 0 4 1,2 5 0 Index Field Value Indexed Values Doc ID ip 192.168.0.1 {0, 3, 5} 0 ip 10.0.0.1 {1, 2, 4} 1 ip 8.8.8.8 {0, 1, 4} 2 Indexing Field Query String Actual Query ip ip:192.168.0.1 ip_bits:0 AND 3 AND 5 ip ip:10.0.0.1 ip_bits:1 AND 4 AND 5 Queries
  43. 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Query system REST API Leverage Spark APIs: PrunedFilteredScan 1. Spark SQL analyzes queries 2. Meta Index is queried with Spark Pushdown values a. Apply partition filters to any Spark Pushdown filters b. Apply bloom filters to any Spark Pushdown filters 3. Hand off resulting file list to Spark: FileScanRDD
  44. 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. ● Hive/Presto like query system ● Custom metastore with probabilistic meta indexes
  45. 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Outcome ● 95% Cost reduction over previous Explorer system ● Minimizes the amount of data pulled from S3 to satisfy queries ● Performance: ○ p95 Query times in seconds ○ p95 Key Value lookup times are sub-second! ● Scale out is nearly zero effort ○ No storage scale out required ○ Query servers auto-scale according to query load
  46. 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  47. 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cheap Scalable • Separate storage from compute • Storage scales with zero effort • Scale compute independently Reliable Fast Amazon S3
  48. 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Spark "On-Demand" Amazon EMR Scaling • Inter-cluster • Intra-cluster • Easy auto scaling Used for • Ingest • Query
  49. 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The future ● Incorporating S3 Select ○ Decrease query time and cost even more ○ Effort currently underway ● Leveraging Kinesis Data Firehose ○ Simplify ingest pipeline ○ Reduce operational complexity ● Go “Serverless” ○ Meta store ○ Query compute ○ Ingest
  50. 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lessons learned • Features first, economics later • A hybrid of build and buy can be effective • Open source ecosystem is rich • AWS tools are flexible building blocks • AWS roadmap leads on innovation High-speed, probabilistic data lakes sourced in Amazon S3 are foundational.
  51. 51. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Gene Stevens @genestevens Joshua Hollander @jholla14 ProtectWise.com/TestDrive
  52. 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×