Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift

6,637 views

Published on

"No matter the industry, leading organizations need to closely integrate, deploy, secure, and scale diverse technologies to support workloads while containing costs. Nasdaq, Inc.—a leading provider of trading, clearing, and exchange technology—is no exception.

After migrating more than 1,100 tables from a legacy data warehouse into Amazon Redshift, Nasdaq, Inc. is now implementing a fully-integrated, big data architecture that also includes Amazon S3, Amazon EMR, and Presto to securely analyze large historical data sets in a highly regulated environment. Drawing from this experience, Nasdaq, Inc. shares lessons learned and best practices for deploying a highly secure, unified, big data architecture on AWS.

Attendees learn:

Architectural recommendations to extend an Amazon Redshift data warehouse with Amazon EMR and Presto.

Tips to migrate historical data from an on-premises solution and Amazon Redshift to Amazon S3, making it consumable.

Best practices for securing critical data and applications leveraging encryption, SELinux, and VPC."

Published in: Technology

(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift

  1. 1. © 2015 Nasdaq, Inc. All rights reserved. “Nasdaq” and the Nasdaq logo are the trademarks of Nasdaq, Inc. and its affiliates in the U.S. and other countries. “Amazon” and the Amazon Web Services logo are the trademarks of Amazon Web Services, Inc. or its affiliates in the U.S. and other countries Nate Sammons, Principal Architect, Nasdaq, Inc. October 2015 BDT314 Running a Big Data and Analytics Application on Amazon EMR and Amazon Redshift with a Focus on Security
  2. 2. NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S IN MARKET CAP REPRESENTING WORTH $9.6TRILLION DIVERSE INDUSTRIES AND MANY OF THE WORLD’S MOST WELL-KNOWN AND INNOVATIVE BRANDSMORE THAN U.S. 1 TRILLIONNATIONAL VALUE IS TIED TO OUR LIBRARY OF MORE THAN 41,000 GLOBAL INDEXES N A S D A Q T E C H N O L O G Y IS USED TO POWER MORE THAN IN 50 COUNTRIES 100 MARKETPLACES OUR GLOBAL PLATFORM CAN HANDLE MORE THAN 1 MILLION MESSAGES/SECOND AT SUB-40 MICROSECONDS AV E R A G E S P E E D S 1 C L E A R I N G H O U S E WE OWN AND OPERATE 26 MARKETS 5 CENTRAL SECURITIES DEPOSITORIES INCLUDING A C R O S S A S S E T CL A S SE S & GEOGRAPHIES
  3. 3. What to Expect from the Session • Motivations for extending an Amazon Redshift warehouse with Amazon EMR • How our data ingest workflow operates • How to query encrypted data in Amazon S3 using Presto and other Hadoop-ecosystem tools • How to manage schemas and data migrations • Future direction for our data warehouse
  4. 4. Current State
  5. 5. Amazon Redshift as Nasdaq’s Main Data Warehouse • Transitioned from an on-premises warehouse to Amazon Redshift • Over 1,000 tables migrated • More data sources added as needed • Nearly two years of data • Average daily ingest of over 7B rows
  6. 6. Never Throw Anything Away • 23 node ds2.8xlarge Amazon Redshift cluster • 828 vCPU, 5.48 TB of RAM • 368 TB of DB storage capacity, over 1PB of local disk! • 92 GB/sec aggregate disk I/O • Resize once per quarter • 2.7 trillion rows: 1.8T from sources, 900B derived
  7. 7. Many Data Sources • Internal DBs, CSV files, stream captures, etc. • Data from all 7 exchanges operated by Nasdaq • Orders, quotes, trade executions • Market “tick” data • Security master • Membership • All highly structured and consistent row-oriented data
  8. 8. Data Corollary to the Ideal Gas Law
  9. 9. Motivations for Extending to Amazon EMR and Amazon S3 • Resizing a 300+ TB Amazon Redshift cluster isn’t instantaneous • Continuing to grow the cluster is expensive • Paying for CPU and disk to support infrequently accessed data doesn’t make sense • Data will expand to fill any container
  10. 10. Extending Our Warehouse
  11. 11. Goals • Build a secure, cost effective, long-term data store • Provide a SQL interface to all data • Support new MPP analytics workloads (Spark, ML, etc.) • Cap the size of our Amazon Redshift cluster • Manage storage and compute resources separately
  12. 12. High Level Overview
  13. 13. Amazon Redshift’s Continuing Role • All data lands in Amazon Redshift first • Amazon Redshift clients have strict SLAs on data availability • Must ensure data loads are finished quickly • Aggregations and transformations performed in SQL • SQL is easy and we have a lot of SQL expertise • Transformed data is then unloaded to Amazon S3 for conversion
  14. 14. Decouple Storage and Compute Resources Scale each independently as needed, run multiple different apps on top of a common storage system Especially for old, infrequently accessed data, no need to run compute 24/7 to support it; we can keep data “forever” Access needs drop off dramatically over time • Yesterday >> last month >> last quarter >> last year
  15. 15. Account Structure and Cost Allocations • Separate AWS accounts for each client / department • Departments can run as much or as little compute as they need; use different query tools, experiments • No competition for compute resources across clients • Amazon S3 costs are shared, compute costs are passed through to each department
  16. 16. Data Ingest Workflow
  17. 17. Data Ingest Overview
  18. 18. Nasdaq Workflow Engine • MySQL-backed workflow engine developed in-house • Orchestrates over 40K operations daily • Flexible scheduling and dependency management • Ops GUI for retrying failed steps, root cause analysis • Moving to Amazon Aurora + Amazon EC2 in 2016 • Clustered operation using Amazon S3 as temp storage space
  19. 19. Amazon Redshift Data Ingest Workflow • Data is pulled from various sources • Validate data, convert to CSVs + manifest • Store compressed, encrypted data in Amazon S3 temp space • Load into Amazon Redshift using COPY SQL statements • Further transformation performed using SQL • UNLOAD transformed data back to Amazon S3 • Notifications to other systems using Amazon SQS
  20. 20. Amazon EMR / Amazon S3 Data Ingest Workflow • Automatically executed after Amazon Redshift loads and transformations complete • Uses Amazon Redshift schema metadata and manifest file to drive conversions to Parquet • Detects schema changes and bumps Hive schema version • Alters schema in Hive Metastore to add new tables, partitions as needed
  21. 21. Data Security and Encryption
  22. 22. VPC or Nothing • Security is our #1 priority at all times • All instances run in a VPC • Locked down security groups, network ACLs, etc. • Least-privilege IAM roles for each app and human • See SEC302 – IAM Best Practices from Anders • EC2 instance roles in Amazon EMR • VPC endpoint for Amazon S3 • 10 G private AWS Direct Connect circuits into AWS
  23. 23. Encryption Key Management • On-premises Safenet LUNA HSM cluster for key storage • Amazon Redshift is directly integrated with our HSMs • Nasdaq KMS: • Internally known as “Vinz Clortho” • Roots encryption keys in the HSM cluster • Allows us full control over where keys are stored, used
  24. 24. Transparent Encryption in Amazon S3 and EMRFS Amazon S3 SDK EncryptionMaterialsProvider interface: • Adapter to retrieve keys from our KMS • Used when reading or writing data in Amazon S3 • User metadata to encode encryption key tokens
  25. 25. Encryption Performance with Amazon S3 • Roughly 25% slower than unencrypted • Seek within an encrypted object works: • Critical for performance • Handled automatically • Seeks are relative to the unencrypted size • Create a new HTTP request at an offset within the object • Encryption offset work is handled in the AWS SDK itself • Worst case, we must read two extra blocks of AES data
  26. 26. Local disk encryption with Amazon EMR • Bootstrap action to encrypt ephemeral disks • Specifically to encrypt Presto’s local temp storage • Standard Linux LUKS configuration • Integrated with the Nasdaq KMS • Retrieves key and mounts disks on startup using init.d
  27. 27. SELinux on Amazon EMR • Bootstrap action to install SELinux packages • Adds kernel command line arguments • Rebuilds initrd image • Reboots the node and re-labels the filesystem • Increases cluster boot time • Currently only working on Amazon EMR 3.8 • Working to refine SELinux policy files for Presto
  28. 28. Presto on Amazon EMR
  29. 29. What is Presto? • https://prestodb.io • Open Source MPP SQL database from Facebook • Flexible data sources through Connector API • JDBC, ODBC drivers • Nice GUI from Airbnb: http://nerds.airbnb.com/airpal/ • Hive Connector: • Table schemas defined in a Hive Metastore as external tables • Data files stored in Amazon S3
  30. 30. Presto Overview
  31. 31. Running Presto on Amazon EMR • Bootstrap action to download and install Java 8 & Presto • Based on the Amazon EMR team’s Presto BA • Adds support for custom encryption materials provider jars • Configures Presto to use a remote Hive Metastore • Currently using Amazon EMR 3.8, working towards 4.0
  32. 32. Data Encryption in Presto • Presto doesn’t use EMRFS for access to Amazon S3 • We added support for Amazon S3 EncryptionMaterialsProvider to PrestoS3FileSystem.java • Code available at github.com/nasdaq • Working with Facebook to integrate these changes
  33. 33. Data Storage Formats
  34. 34. File Formats: Parquet vs. ORC The two most widely used structured data file formats: • Compressed, columnar record storage • Structured, schema-validated data • Supported by a variety of Hadoop-ecosystem apps • Arbitrary user metadata encoded at the file level
  35. 35. ORC Pros: • DATE and TIMESTAMP type support in Hive, Presto Cons: • Rigid column ordering requirements • Clunky Java API • Unacceptable performance when encrypted in Amazon S3 • 15-18x slower during our testing (!)
  36. 36. The Winner: Parquet • Wide project support: Presto, Spark, Drill, etc. • Actively developed project • Adoption increasing • Column referenced by name instead of position • Set hive.parquet.use-column-names=true in Presto config • Good performance when encrypted (~27% slower) • Clean Java API
  37. 37. Parquet Schema Workarounds DATE not supported in Hive or Presto • Instead, convert DATEs to an INTs • 2015-10-08 becomes 20151008 • Timestamps become a BIGINT (64bit integer in Hive) • For nanosecond resolution records, we use a DATE and a separate nanos-since-midnight column
  38. 38. Schema and Data Management
  39. 39. Hive Metastore • Amazon EMR 4.0 cluster for the Metastore • Easier for remote access from Presto • Reachable through VPC peering with client accounts • The “source of truth” for Hive schemas • Metastore DB on Amazon RDS for MySQL • Easy backups, encrypted storage • Data ingest system creates/alters tables • Alters tables to add new data partitions each day • Detects newly changed schemas
  40. 40. Managing Versioned, Partitioned Tables in S3 • Store versions of a table in directories in Amazon S3: s3://schema/table/version/date=YYYYMMDD/*.parquet Works with “msck repair table” commands • When a schema change is detected, increment the version. New data is written to the new location, alerts are generated for humans to determine changes. • Data is migrated in Amazon S3 and old versions are kept for now
  41. 41. Logical vs. Physical Schemas • Track a “logical” and “physical” schema for each table • Logical is compared with Amazon Redshift to detect changes • Physical schema used to produce Hive DDL for Presto • Schema definitions stored in MySQL • Version management and change detection • Amazon S3 location for each table • Tools to export these schemas as .sql files • Hive schema and table create statements • “msck repair table” scripts
  42. 42. File-level Metadata We encode information in file-level metadata: • Partition column definition • Time zone in which the file was parsed • Current & original schema name and version number • Column data type adjustments (DATE -> INT, etc.) Allows us to always recreate logical schema representations from physical files, re-migrate files if a data migration step had a bug, etc.
  43. 43. Table Partitioning and Data Management • Partition hive tables by date • We have mostly timeseries data and are on a daily cadence • Partitioning helps query performance • Use `backticks` when defining column names in SQL • Column names must be lower case in Parquet • Correct bad data in Amazon Redshift through SQL, then UNLOAD partitions for encoding to Parquet • Our tools and automation make it easy to replace modified partitions of data in Hive tables
  44. 44. Working with data in S3 and Amazon Redshift Custom tools developed to make life easier: • Extract CSV data from various DBs, or UNLOAD from Amazon Redshift in whole or in segments • Encode CSVs as Parquet files using a Hive schema • Write data into the correct directory structure in Amazon S3 • Allows us to move data between Amazon Redshift and Amazon S3 easily, and in bulk
  45. 45. Custom Parquet Data Migration Tools • Read records from previous version of a table • Reads from the old location in Amazon S3 • Write records using the current version of a table • Writes to the new location in Amazon S3 • Most migrations are trivial: • Add new column with some default value (or null) • Rename columns • More complicated migrations require Java code • Track original and current version in file metadata
  46. 46. Review & Future Enhancements
  47. 47. Review • Motivations for extending an Amazon Redshift warehouse with Amazon EMR • How our data ingest system operates • How to query encrypted data in Amazon S3 using Presto and other Hadoop-ecosystem tools • How to manage schemas and data migrations
  48. 48. Lessons Learned: TL;DR • Manage storage and compute separately • It’s OK to be paranoid about data loss! • Amazon S3 encryption is easy and seek() works • Parquet vs. ORC • Partition and version your tables • Manage logical and physical table schemas • Data management tools & automation are important
  49. 49. Future Enhancements • Archive original source data for SEC 17-a4 compliance (using Amazon Glacier Vault Lock) • Decouple data retrieval and processing tasks • Move ingest processing to Amazon EC2/Amazon ECS • Move workflow engine DB to Amazon Aurora • Leveraging other query frameworks: Spark, ML, etc. • Near real-time streaming ingest • More data sources
  50. 50. Related Sessions
  51. 51. Remember to complete your evaluations!
  52. 52. Thank you!

×