---------------------------------------------------------------
Outline
---------------------------------------------------------------
About Myself
Role ofAnalytics
About Abebooks
Innovation and Data
Abebooks DW modernization
ETLTool selection for the Cloud
AWS Analytics Solution Prices example
Some Free Learning resources
---------------------------------------------------------------
Disclaimer
---------------------------------------------------------------
The content of this presentation doesn’t represent Abebooks, Amazon, and AWS.This
information is based on my knowledge, experience and doesn’t have any sensitive or
confidential data. All information and pictures are available online.
About Myself
• Work с Business Intelligence
since 2007
• Canada since 09/2015
#dimaworkplace
Technical Skills Matrix
2015
2010
2007
Data
Warehouse
ETL/ELT
Business
Intelligence
Big Data )
Cloud
Analytics
(AWS,
Azure,
GCP)
Machine
Learning
2019
Other Activities
Jumpstart Sno
wflake: A Step-
by-Step Guide
to Modern
Cloud Analytics.
• BITechTalk (100+ BI teams globally)
• AmazonTableau User Group (2000+ users)
• Conferences (EDW 2018, 2019, Data Summit, SQLPass)
• Amazon internal conferences
---------------------------------------------------------------
Outline
---------------------------------------------------------------
Role ofAnalytics
BusinessValue
Stakeholders Employees Customers
Value
”The goal of any organization is to generateValue”
The Future of Competition.
https://www.amazon.com/Future-Competition-Co-Creating-Unique-Customers/dp/1578519535
BIValue Chain
Stakeholders Employees Customers
Value
Decisions
Data
Value creation based on effective decisions
Effective decisions based on accurate
information
---------------------------------------------------------------
Outline
---------------------------------------------------------------
About Abebooks
About Abebooks
• Online marketplace for books, art & collectibles.
• Amazon subsidiary since 2008 we are a marketplace
for used books and increasingly non-book-
collectibles
• 350M listings
• 3 in ‘Data EngineeringTeam’ for 120
• 2 locations:Victoria, BC and Dusseldorf
---------------------------------------------------------------
Outline
---------------------------------------------------------------
Innovation and Data
History of Innovation
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??
We are here
https://www.weforum.org/about/the-fourth-industrial-revolution-by-klaus-schwab
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
??
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
4th Industrial
Revolution
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
4th Industrial
Revolution
Industrial
Revolution
2nd Industrial
Revolution
Digital
Revolution
4th Industrial
Revolution
AWS Rapid Pace of Innovation
Chart: #AWS services
https://www.slideshare.net/AmazonWebServices/a-culture-of-innovation-powered-by-aws
---------------------------------------------------------------
Outline
---------------------------------------------------------------
Analytics powered by AWS
For Data to be a differentiator, customers need to be able to…
• Capture and store new non-relational data at
PB-EB scale in real time
• Discover value in a new type of analytics that
go beyond batch reporting to incorporate
real-time, predictive, voice, and image
recognition
• Democratize access to data in a secure and
governed way
New types of analytics
Dashboards Predictive Image
Recognition
VoiceReal-time
New types of data
Data & analytics partners extend the traditional
approach
DataWarehouse
Business Intelligence
OLTP ERP CRM LOB Devices Web Sensors Social
Big Data processing,
real-time, Machine Learning
Data Lake
 Relational and non-relational data
 TBs–EBs scale
 Diverse analytical engines
 Low-cost storage & analytics
---------------------------------------------------------------
Outline
---------------------------------------------------------------
DW Modernization
BI/DW (before)
Storage LayerSource Layer
Ad-hoc SQL
SFTP
Data Warehouse
ETL (PL/SQL)
Files
Inventory
Sales
Access Layer
BI/DW Survey
Cloud Migration Strategy
Lift & Shift
• Typical Approach
• Move all-at-once
• Target platform then evolve
• Approach gets you to the cloud quickly
• Relatively small barrier to learning new technology
since it tends to be a close fit
Split & Flip
• Split application into logical functional data layers
• Match the data functionality with the right
technology
• Leverage the wide selection of tools on AWS to
best fit the need
• Move data in phases — prototype, learn and
perfect
Choosing ETLTool for Cloud
Use Cases
• OLTP to Redshift
• SFTP/API to Redshift
• DataTransformation
• Dimensional Modelling
• AWS Integration
• Big Data
Tools
• Informatica
• AWS Glue
• Talend
• Fivetran
• Alooma
• Stitch
• Matillion
ETL Criteria
High:
• Security
• Support Redshift
• CDC
• Ease of Use for
BI/DW
• Cover use cases
• On-Premise and
full control
Medium:
• Support NoSQL
• Deployment/Architecture
• Encryption
• Ease of Use for non BI/DW
• DataTransformations
• Management
• Pricing
• Performance
Low:
• Version Control
• Linux OS
• ETL Monitoring
• Logging
• R/Python
WhyWe Picked Matillion
• Was built for Redshift and Cloud
• Speed of ELT operations
• Speed of development
• Wide range of data sources supported
• Ease of use outside of DE/DBA expertise
• Native with AWS
• $$$
After 2 years of usage, it proved our expectations!
Matillion ETL
DW Modernization (after)
DW Modernization (extending)
---------------------------------------------------------------
Outline
---------------------------------------------------------------
Use Cases and Challenges
OLTP to Dimensional Modelling
Problem: Heavy transformations, lots of dependencies. Users like to consume classical star
schema. Lot’s of tables with CDC.
Solution: Using Matillion, we implemented CDC pattern and used it across all tables.
Visualize all jobs and dependencies. Using built-in components easily created Dimensional
Model.
Self-Service BI
Problem: Business Users wants Interactive and Self-Service tool. Fast time to Market and
less dependency on IT.
Solution:We choseTableau as a leader of BI and highly adopted acrossAmazon.
Integration with BI
Problem: Having best BI tool doesn’t guaranty good SLA.
Solution: Build Integration between Matillion ETL andTableau based onTrigger. Add data
quality checks.
Lack of Notification
Problem: Users are missing emails or they jump to spam.
Solution: Leverage Messenger with Webhooks. (Slack, Chime or so on).
Lack of Logging
Problem:We didn’t have any detail logs about our ETL performance and we didn’t have any
insights.
Solution: Matillion allow us own logs and audit engine. In addition, we are able to collect
logs on any level of ETL jobs and transformation.
MarketingAutomation
Problem: Marketing team wants “Move Fast and BreakThings”.
Solution: Using Matillion the gave Marketing template jobs and they doing their jobs
themselves using Build In marketing data connections.
Affiliates
Insights
DW slow done
Problem: After sometime, Redshift DW starts hitting concurrency and performance issues.
Solution: Scale Redshift Cluster based on current needs (couple minutes), implement
automation forVacuum and Compression.AddedWLM.
Amazon Redshift Utils: https://github.com/awslabs/amazon-redshift-utils
NoSQL (DynamoDB) to DW
Problem: Our main inventory database moved to NoSql (DynamoDB) and it is a challenge
to get incremental changes to the Redshift. It is a challenge to get incremental changes
with default functionality and costly.
Solution: Using mix of AWS tools like AWS Kinesis Firehose, AWS Glue and Matillion, we are
able to capture changes every hour.
Inventory
Changes
DynamoDB Kinesis
Firehose
Store
Changes
Glue converts to Parquet Redshift DW
Clickstream Logs (Big Data)
Problem: Business wants to analyze Bots traffics and discover broken URLs. Access logs are
~50GB per day, 7000 log files per day.
Solution: Leveraging Elastic Map Reduce and Spark in order to produce Parquet file. Using
AWS Glue, we built serverless Data Lake with Amazon Spectrum.
EMR+Spark Processing
Access Logs
Parquet into S3 Query via Spectrum
Crawler with Glue
SecurityAudit
Problem:We built solution fast and often stored string passwords or critical data.
Solution: Used AWS Secrets Manager, AWS Macie and updated DataTransformation logic
in order to exclude sensitive data.
Machine Learning
Problem:We are marketplace, buyers are searching product by category and sellers do bad
work with category labeling ->difficult to find product for Buyer.
Solution: Leveraging Amazon Sage Maker image classification deep learning -
Convolutional Neural Network (CNN).
Sheet Music
---------------------------------------------------------------
Outline
---------------------------------------------------------------
Prices
X-Small Package Medium Package
• BI Server** (EC2) (16vCPU/64RAM) – 585$
• ETL Server (EC2) (8 RAM) – 73$
• Data Lake (S3) 50TB – 1200$
• Big Data Processing (EMR) 3node
• DW (Redshift) 10TB – 2500$
• 3rd Party ETL (Matillion) 1780$
• Redshift Spectrum ~500$
• Support – 460$
Total: ~7098*$
Example of Monthly Prices in US$
* you might get significant discount forYearly Reserved Instances
** not include BI tool license cost
https://calculator.s3.amazonaws.com/index.html
• BI Server (EC2) – 146$
• ETL Server (EC2) – 31$
• DW (Redshift) 2TB – 622$
• S3 Storage (50Gb) – 1.15
• 3rd Party ETL (Matillion) 986$
• Support – 123$
Total: ~2348$*
---------------------------------------------------------------
Outline
---------------------------------------------------------------
Free Learning Resources
Coursera and Edx:
• Data Warehouse for Business Intelligence Specialization
• Data Engineering on Google Cloud Platform
• Architecting with Google Cloud Platform
AWSTutorials:
• Getting Started with Amazon Redshift
• SizingAmazon Redshift
• Getting Started with Amazon Spectrum,Athena, Glue, EMR
• AWS FreeTier (for example 2 months of Redshift)
AWSTrainings:
• AWSTechnical Essentials
Other:
Google Machine Learning Crash Course (Deep Learning withTensorFlow)
MatillionTrial and Learning Materials
TableauTrial and Learning Materials
Contact
LinkedIn: Dmitry Anoshin
Dmitry.Anoshin@gmail.com

Building Modern Data Platform with AWS

Editor's Notes

  • #16 The first Industrial Revolution started around 1780, and this was the beginning of the Age of Machines.
  • #17 Steam and water power was used to power engines. For the first time, we were able to produce goods using machines.
  • #18 Steam power fueled the second Industrial Revolution [which started around 1870]. Here’s the Chicago World’s Fair in 1893 – and that was the place to be if you wanted to see the latest electricity-based inventions, such as lighting systems and elevators.
  • #19 And electric power also enabled mass production. The assembly line made it possible to mass-manufacture new inventions such as the automobile and telephones. Now, up to this point, data was still kept by hand. What data integration looked like was two accountants comparing paper ledgers with each other.
  • #20 It was only during the Third Industrial Revolution – better known as the Digital Revolution – that digital data as we know it came to be. The Digital Revolution started in the 1960s, it was fueled by electricity, and it introduced computerized automation. It has brought the innovations we know and love, such as computers, smartphones, and the Internet.
  • #21 The ability to manufacture a large variety of products + the internet gave rise to Amazon. Here’s how the Amazon.com website looked like when it launched in 1995.
  • #22 And here’s the website this year. This is what customers see. The changes are not just in the user experience, everything else has changed.
  • #23 This is what is happening behind the scenes.
  • #24 This is the floor plan for just one of our datacenters in Virginia. Amazon developed a very sophisticated data integration platform. We were running the largest Oracle data warehouse in the world.
  • #25 4th industrial revolution
  • #26 Alexa
  • #29 Driven by data
  • #32 The size and complexity of the data that needs to be analyzed today means the same technology and approaches that worked in the past don’t work anymore. First, the volume of data is growing exponentially with machine-generated data from internet-connected devices growing 10x faster than data from business applications. This makes it impractical for customers to purchase and install larger, more powerful hardware each time storage and compute capacity limits are reached, and also limits moving massive amounts of data to a separate analytics system prior to analyzing it. Second, the types of available data are changing from traditional operational data that are structured as tables and columns to data being generated by new sources like social media, mobile apps, websites, and devices. Customers can no longer constrain their analytics to relational data, but now need to be able to store and analyze any type of data, including non-relational data without defined relationships or schema. Third, as data is generated in real-time, customers need to go beyond analyzing historical data to analyzing data as it becomes available.
  • #33 To get the most value from their data, customers need a scalable, secure, and comprehensive data storage and analytics platform. Customers need to be able to securely store data coming from applications and devices in its native format, with high availability, durability, at low cost, and at any scale. In short, they need a data lake. Customers need to easily access and analyze data in a variety of ways using the tools and frameworks of their choice in a high performance, cost effective way without having to move large amounts of data between their storage and analytics systems. And, customers need to go beyond visualization and insights from operational reporting on historical data, to being able to perform machine learning and real-time analytics to accurately predict future outcomes.
  • #40 company a 'winner' will this tool be supported and fully usable in 3-5 years will this be adopted by Amazon, will there be a community of use recommendations within Amazon (such as AWS SA) years in business, customers, profitability management - scheduling built in - intuitive views of DW processes, models, schedules - does it help someone understand DW data flows deployment / architectures - AWS better than local - linux better than windows - must be patchable platform within Amazon guideline
  • #41 Biggest risk was the investment in a tool from a small player Porting ETL processes from Matillion would be no less expensive than from PL/SQL and dblinks