Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT

195 views

Published on

This is the presentation we shared at the AWS Summit 2017 in Bangalore. We are showcasing our high performance framework with the various components which enables an organization to be data driven. Find out how our components are engineered to scale, store data securely and process the data for insights.

Published in: Data & Analytics
  • Be the first to comment

Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT

  1. 1. Architecting Data Lakes on AWS with HiFX Established in the year 2001, HiFX is an Amazon Web Services Consulting Partner that has been designing and migrating applications and workloads in the cloud since 2010. We have been helping organisations to become truly data driven by building data lakes in AWS since 2015.
  2. 2. 2 The Challenges Lack of agility and accessibility for data analysis which would aid the product team to make smart business decisions and improve strategies Increasing volume and velocity of data. With new digital properties getting added, there was a need to design the collection and storage layers that would scale well. Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing. Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, granting access and integration. 04 03 02 01
  3. 3. Our Journey from Data to Decisions with an AWS powered Data Lake Connecting dozens of data streams and repositories to a unified data pipeline enabling near realtime access to any data source Engineering well designed big data stores for reporting and and exploratory analysis Architect a secure, well governed data lake to store all data in a raw format. S3 is the fabric with which we have woven the solution. Processing data in streams or batches to aid analytics and machine learning, supplemented by smart workflow management to orchestrate the tasks Dynamic dashboards and visualisations that makes data tell stories and help drive insights. Offering recommendations and predictive analytics off the data in the data lake
  4. 4. COLLECT STORE PROCESS CONSUME Scribe (Collector) Accumulo (Storage) Acccumulo is the data consumer component responsible for reading data from the event streams (Kinesis Streams), performing rudimentary data quality checks and converting data to Avro Format before loading it to the Data Lake Our Data Lake in S3 captures and store raw data at scale for a low cost. It allows us to store many types of data in the same repository while allowing to define the structure of the data at the time when it is used scribe accumulo Scribe collects data from the trackers and writes them to Kinesis Streams It is written in Go and engineered for high concurrency, low latency and horizontally scalability Currently running on two c4.large instances, our API latency for 50 percentile is 12.6ms and 75 percentile is 36ms. This is made possible because of the consistent and predicable performance of Kinesis
  5. 5. COLLECT STORE PROCESS CONSUME
  6. 6. 6 Why Amazon S3 For Data Lake ? Performance relatively lower than an HDFS cluster, but doesn't affect our workloads significantly. EMRFS with consistent view (backed by DynamoDB) works really well Native support for versioning, tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies and Secure– SSL, client/server-side encryption Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Unlimited number of objects and volume of data, along with 99.99% availability and 99.999999999% durability. Lower TCO and easier to scale than HDFS Decoupled storage and compute allowing multiple & heterogeneous analysis clusters to use the same data 04 03 02 01
  7. 7. Prism (Processor) Lens (Consumer) Custom built reporting & visualisation app to help business owners to easily interpret, visualise and record data and derive insights Detailed Analysis of KPIs, Event Segmentation, Funnels, Search Insights, Path Finder, Retention/Addiction Analysis etc powered by Redshift and Druid. Using Pgpool to cache Redshift queries. process consume Unified Processing Engine using Apache Spark running on EMR written in Scala Airflow is used to programmatically author, schedule and monitor workflows Prism generates data for tracking KPIs and perform funnel, pathflow, retention and affinity analysis. It also include machine learning workloads that generate recommendations and predictions COLLECT STORE PROCESS CONSUME
  8. 8. COLLECT STORE PROCESS CONSUME
  9. 9. KPIs Product relationship Understand which products are viewed consecutively Product affinity Understand which products are purchased together Sales Hourly, daily, weekly, monthly, quarterly, and annual Average market basket Average order size Cart abandonment rate Shopping cart abandonment rate Days/ Visits To "Purchase" The average number of days and sessions from the first website interaction to purchase. Cost per Acquisition (Total Cost of Marketing Activities) / (# of Conversions) Repeat purchase rate What % of our customers are repeat customers
  10. 10. Product page performance Measuring product performance The scatter plot compares the number of unique users that view each product with the number of unique users that add the product to basket, with the size of each dot being the number of uniques that buy the product. Any products located in the lower right corner are highly trafficked but low converting - any effort spent fixing those product pages (e.g. by checking the copy, updating the product images or lowering the price) should be rewarded with a significant sales uplift, given the number of people visiting those pages
  11. 11. Measuring product performance In contrast, products located in the top left of the plot are very highly converting, but low trafficked pages. We should drive more traffic to these pages, either by positioning those products more prominently on catalog pages, for example, or by spending marketing dollars driving more traffic to those pages specifically. Again, that investment should result in a significant uplift in sales, given how highly converting those products are. Similarly, products in the lower left corner are performing poorly - but it is not clear whether this is because they have low traffic levels and /or are poor at driving conversions. We should invest in improving the performance of these pages, but the return on that investment is likely to be smaller (or harder to achieve) than the other two opportunities Product page performance
  12. 12. Identifying products / content that go well together Market basket analysis is an Association rule learning technique aimed at uncovering the associations and connections between specific products in our store In a market basket analysis, we look to see if there are combinations of products that frequently co-occur in transactions. We can use this type of analysis to: • Inform the placement of content items on sites, or products in catalogue • Drive recommendation engines (like Amazon’s customers who bought this product also bought these products…) • Deliver targeted marketing (e.g. emailing customers who bought products specific products with other products and offers on those products that are likely to be interesting to them.)

×