Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1

Share

Download to read offline

Data lakes on Amazon Web Services

Download to read offline

My April 2017 presentation to the Wellington AWS meetup.
Rolling your own data lake takes awesome effort and expense. But if you let AWS do the undifferentiated heavy lifting, you may wonder what the fuss is about. First, a demonstration of configuring and using Quicksight, Athena, and S3 as an easy-to-configure Data Lake. Then architecture patterns for fitting Quicksight and Athena into a broader analytics platform.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Data lakes on Amazon Web Services

  1. 1. DATA LAKES ON AWS: 
 GOOD, FAST, AND INEXPENSIVE WELLINGTON AWS USER GROUP Photo: Frank Kovalchek, http://www.flickr.com/people/72213316@N00
  2. 2. VOLUME, VARIETY, AND VELOCITY
  3. 3. Photo: Frank Kovalchek, http://www.flickr.com/people/72213316@N00 WHAT IS A DATA LAKE? ▸ A network file share full of spreadsheets is a (bad) data lake ▸ Focused on making it easy to collect large amounts of data ▸ A place to store data in its natural format for future analysis ▸ Instead of Big Design Up Front (BDUF) shifts governance right in order to remove barriers and empower users. ▸ Accepts in principle that slightly inefficient computer costs make data scientists more productive.
  4. 4. 2 MINUTE DATA LAKE DEMONSTRATION Photo: Tim Evanson, https://www.flickr.com/photos/timevanson/
  5. 5. DATA WAREHOUSE PROBLEMS SOLVED BY S3 ▸ Dropbox’s distributed storage system 
 on IEEE Software Engineering Radio 
 (Masses of people, enormous capital, long timeframe) ▸ Running out of space, capacity planning ▸ Slow hardware, unable to drink from the firehose ▸ Significant developer cost and delay 
 before data can be analyzed to determine if it is valuable.

  6. 6. ▸ Elastic scalability ▸ High Availability ▸ Coupling storage to compute (HDFS) ▸ Hosting and admin cost of running EMR clusters ▸ No need to run your own data dictionary (Hive metabase) and persist it HA between cluster outages. ▸ No need to run your own security (Apache Ranger) DATA LAKE PROBLEMS SOLVED BY ATHENA
  7. 7. BUSINESS INTELLIGENCE PROBLEMS SOLVED BY QUICKSIGHT ▸ Performance at scale ▸ High Availability ▸ Hosting and admin cost of running servers
  8. 8. COMPETITORS ▸ Azure has similar offerings ▸ PowerBI is good ▸ Azure Data Lake Analytics differences: ▸ Not elastic ▸ No optimized storage: ORC or parquet ▸ Uses HDFS service, not Blob store
  9. 9. VISUALISATION DEMONSTRATION Photo: Geo Swan, https://commons.wikimedia.org/wiki/User:Geo_Swan
  10. 10. UNEVEN COMPARISONS VS ▸ On premise performance will start slower and scale poorly ▸ AWS Enterprise support vs ticket logging ▸ High availability, Disaster recovery, backup costs included ▸ On premise costs escalate rapidly with scale. 
 ~$1,000,000,000 per petabyte every year
  11. 11. TRUE COSTS OF SERVERS ▸ Servers aren’t being patched ▸ Servers aren’t natively Highly Available ▸ Server backups need to be configured, and can be misconfigured ▸ Server configuration slows down development ▸ Server performance suffers before scaling Photo: Micheal Filion, https://www.flickr.com/photos/mike9alive/
  12. 12. ARCHITECTURE: DATA INGESTION
  13. 13. OPERATIONAL ANALYTICS
  14. 14. TRACKING PERFORMANCE
  15. 15. RESEARCH
  16. 16. EXTENDED DATA LAKE
  17. 17. THE “PROJECT MANAGEMENT TRIANGLE” Photo: Kevin Lim, https://www.flickr.com/photos/inju/
  18. 18. VICTIMS OF THE SYSTEM
  19. 19. CULTURE, AUTOMATION, LEAN, MEASUREMENT © BrokenSphere / Wikimedia Commons ▸ Not tool specialists - can focus elsewhere ▸ Tool “automates” the hard part of the task ▸ Tool only does the part of the job that has value ▸ Transparency - everyone can see the results
  20. 20. NO ONE WANTS A DRILL ▸ This presentation is about tools, people want outcomes. ▸ Knowing your tools is good, 
 making them the focus of your work is wrong. ▸ Providing value with a data lake is about asking the important questions, and answering those questions accurately. ▸ I strongly recommend asking the correct question over using the correct tool. ▸ Thinking with Data by Max Shron Photo: United States Marine Corps.
  21. 21. STEVEN ENSSLEN - AUTOMATION FOR BUSINESS INTELLIGENCE ▸ AWS Certified Solutions Architect - Professional ▸ Big data and business intelligence consulting
 ▸ http://stevenensslen.com ▸ steven@stevenensslen.com
  • SureshManik

    Feb. 22, 2018

My April 2017 presentation to the Wellington AWS meetup. Rolling your own data lake takes awesome effort and expense. But if you let AWS do the undifferentiated heavy lifting, you may wonder what the fuss is about. First, a demonstration of configuring and using Quicksight, Athena, and S3 as an easy-to-configure Data Lake. Then architecture patterns for fitting Quicksight and Athena into a broader analytics platform.

Views

Total views

292

On Slideshare

0

From embeds

0

Number of embeds

48

Actions

Downloads

4

Shares

0

Comments

0

Likes

1

×