Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DoneDeal - AWS Data Analytics Platform

DoneDeal AWS Data Analytics Platform build using AWS products: EMR, Data Pipeline, S3, Kinesis, Redshift and Tableau. Custom built ETL was written using PySpark.

  • Be the first to comment

DoneDeal - AWS Data Analytics Platform

  1. 1. DoneDeal  -­‐  Data  Pla+orm   April  2016   Mar6n  Peters     (mar6n@donedeal.ie  /  @mar6nbpeters)   DoneDeal  Analy6cs  Team  Manager  
  2. 2. If you don’t understand the details of your business you are going to fail. If we can keep our competitors focused on us while we stay focused on the customer, ultimately we’ll turn out all right. - Jeff Bezos, Amazon
  3. 3. What  do  these  companies  have  in  Common?
  4. 4. Data  is  … With the right set of information, you can make business decisions with higher levels of confidence, as you can audit and attribute the data you used for the decision-making process. - Krish Krishnan, 2014 one  of  our  biggest  assets.
  5. 5. Business  Intelligence  101 For  small  companies  the  gap  is  oNen  filled  with   custom  ad  hoc  solu6ons  with  limited  and  rather   sta6c  repor6ng  capability.
  6. 6. What  and  why  BI? As    a  company  grows,  the  Availability,  Accuracy  and   Accessibility  requirements  of  data  increases.
  7. 7. Some  terminology:  ETL  process Extrac6on Extracts data from homogeneous or heterogeneous data sources. Transforma6on: Process, Blend, merge and conform the data Loading: Store in the proper format or structure for the purposes of querying and analysis.
  8. 8. April  2015  -­‐  April  2016
  9. 9. Timeline:  2014-­‐2017 2014 2015 2016 2017 Silo’d Data Manual/Error Prone Blending Value of BI/Data not understood Platform Design Implementation Storage Layer Batch Layer Traditional BI Serving Layer Speed Layer Real Time Analytics
  10. 10. Business  Goals  &  Objec6ves 1.  Build  a  future  proof  data  analy2cs  pla5orm  that  will  scale  with  the  company   over  the  next  5  years.   2.  Take  ownership  of  our  data.  Collect  more  data.   3.  Replace  exis2ng  repor2ng  tool.   4.  Provide  a  holis2c  view  of  our  users  (buyers  and  sellers),  ads,  products   5.  Use  our  data  in  a  smarter  manner  and  provide  recommenda2ons  in  a  2mely   fashion.  
  11. 11. Apollo  Team Data Engineer Data Analyst Architect DevOps BI Consultants Solution Architect • Analy2cs  Pla5orm  that  includes  Event  Streaming,  Data  Consolida2on,  Cleansing  &  Warehousing,  Data   Visualisa2on,  Business  Intelligence  and  Data  Product  Delivery.   • Apollo  brings  agility  and  flexibility  in  our  data  model,  data  ownership  is  key  and  allows  us  to  blending   data  more  conveniently
  12. 12. Apollo  Principles 1.  System  must  scale  but  costs   grow  more  slowly   2.  Occam’s  Razor   3.  Analy2cs  and  core  pla5orms   are  independent   4.  Monitoring  of  pla5orm  is   key   5.  Low  maintenance Project  Principles: Data  Principles: 1.  Accurate,  Available,  Accessible   2.  Ownership  -­‐  Business  &  Technical     3.  Standardised  across  teams   4.  Integrity     5.  Iden2fiable  -­‐  primary  source  and   globally  unique  iden2fier
  13. 13. Apollo  Architectural  Principles www.slideshare.net/AmazonWebServices/big-data-architectural-patterns-and-best-practices-on-aws •  Decoupled  “data  bus”   •  Use  the  right  tool/service  for  the  job   ➡  Data  structure,  latency,  throughput,  access  paerns   •  Use  Lambda  architecture  ideas   ➡  Immutable  (append-­‐only),  batch,  [speed,  serving]  layers   •  Leverage  AWS  Managed  Services   ➡  Scalable/elas2c,  available,  reliable,  secure,  no/low  admin   •  Big  data  !=  Big  Cost
  14. 14. Tools/Services  in  Produc6on Data Science Business Users
  15. 15. ETL  Architecture:  Custom  Build  Pipeline E T L Summary Summary Summary
  16. 16. ETL:  Control  over  complex  dependencies • Allows control of ETL pipelines with complex dependencies • Easy plug-in of new datasource • Orchestration with Data Pipeline and Common Status or Summary Files • Idempotent Pipeline • Historical data extracted as simulated stream
  17. 17. ETL:  By  the  numbers • Extrac6on   -­‐ 4000  days  processed   -­‐ 7  different  data  sources   -­‐ 14  domains   -­‐ 13  event  types   • Orchestra6on   -­‐ 1200  processing  days   -­‐ 4GB/day   -­‐ 3  Environments     -­‐ 15  data  pipelines • Data  Lake   -­‐ 11M  events  streamed/day     -­‐ 3  million  files   -­‐ 3  TB  of  data  stored  over  7  buckets   • RedshiN   -­‐ 7B  records  in  produc6on   -­‐ 6  Schemas  (core  and  aggregate)   -­‐ 86  Tables  in  core  schema
  18. 18. Kinesis  Streams • 1  Stream  with  4  Shards   • Data  reten6on  of  24hrs   • KCL  on  EC2  writes  data  to  S3  ready  for  Spark   • Max  size  of  1MB  data  blog   • 1,000  records/sec  per  shard  write   • 5  transac6ons/sec  read  or  2MB/sec   • Server  side  API  Logging  from  7  applica6on   servers  using  Log4JAppender   • Event  Buffering  at  source  [in  progress] Put records Requests
  19. 19. S3 • Simple Storage Service provides secure, highly- scalable, durable cloud storage • Native support for Spark, Hive
  20. 20. S3 • A strongly defined naming convention • YYYY/MM/DD prefix used • Avro format used for OLTP data/ JSON otherwise - probably the right choice (schema evolution), although we haven’t take any advantages for those yet. • Allow easy retrieval of data from a particular time period • Easy to maintain and browse • Handling of summaries from E,T & L steps
  21. 21. Spark  on  EMR • AWS’s  managed  Hadoop  framework  that  can   interact  with  data  from  S3,  DynamoDB,  etc.   • Apache  Spark  -­‐  Fast,  general  purpose  engine   for  large-­‐scale  in-­‐memory  data  processing.   Runs  on  Hadoop/EMR  and  can  read  from  S3.   • PySpark  +  SparkSQL  was  the  focus  in  Apollo.   • Streaming  and  ML  will  be  the  focusing  the   months  ahead.
  22. 22. • Spark is easy, performant Spark code is hard and time consuming • DataFrame API exclusively • Developing Spark applications in local environment with limited size dataset significantly differs from running Spark on EMR (e.g. joins, unions etc.) • Don’t pre-optimize • Naive joins to be avoided • Spark UI is invaluable to test performances (both locally and on EMR) and to understand the underlying mechanism of Spark •Some  scaling  of  Spark  on  EMR,  seled  on   memory  op2mised  instances  r3.2xlarge  (8   vCPUs,  61GB  RAM). Spark  on  EMR
  23. 23. Data  Pipeline  +  Simple  No6fica6on  Service • Pipeline  is  a  service  to  reliably  process  and   move  between  AWS  applica6ons  (e.g.  S3,  EMR,   DynamoDB)   • Pipelines  run  on  schedule  and  alarms  are   issued  with  Simple  No6fica6on  Service  (SNS)   • EMR/Spark  used  for  compute  and  EC2  used  for   loading  data  in  RedshiN   • Debugging  can  be  a  challenge
  24. 24. RedshiN • Dense  Compute  or  Dense  Storage?   -­‐ Single  ds2.xlarge  instance   -­‐ Right  balance  between  storage/memory/ compute  and  cost/hr   • Strict  ETL,  no  transforma2on  is  carried  out  in  DW,   an  Append  Only  Strategy   -­‐ Leverage  power  and  scalability  of  EMR  and   Insert  speed  of  Redshif   -­‐ No  Updates  in  DW,  Drop  and  Recreate     • Tuning  is  a  2me  consuming  task  &  requires   rigorous  tes2ng.   • Define  Sort,  Distribu2on,  Interleaved  keys  as  early   as  possible.   • Reserved  Nodes  will  be  used  in  future Test Dev Prod Core cmtest cmdev cmprod Agg agtest agdev agprod Test Dev Prod Core cmtest cmdev cmprod Agg agtest agdev agprod read permissions Kimball  Star  Schema:  Conformed  dimensions   across  all  data  sources
  25. 25. Tableau  on  EC2 • Tableau  Server  runs  on  EC2  (c3.2xlarge)  inside  AWS  Environment.     • Tableau  Desktop  used  to  develop  dashboards  that  are  published  to  the  server.   • Connec2on  to  Redshif  Data  Warehouse  -­‐  JDBC/ODBC  Connector.   • Maps  support  is  poor  for  countries  outside  the  US http://www.slideshare.net/AmazonWebServices/analytics-on-the-cloud-with-tableau-on-aws
  26. 26. Up  Next? • Increase  number  of  data   streams/Remove   dependence  on  OLTP   • Tradi2onal  BI/Repor2ng  -­‐   More  dashboards   • [In  progress]  Data  Products   with  Spark  ML/Amazon  ML,   DynamoDB,  Lambda  &  API   Gateway • Trials  of  Kinesis  Firehose,   Kinesis  Analy2cs,  Quicksight   • Improved  Code  Deployment   with  Code  Pipeline  and   Code  Commit
  27. 27. DoneDeal  Image  Service  Upgrade •Image  Storage  &  Transforming  moved  to  AWS   •Over  4.5M  images  migrated  to  S3   •ECS  +  ELB  used  for  image  resizing   •Autoscaling  group  enables  adding  new  image  sizes   •We  now  run  docker  in  produc2on  thanks  to  ECS   •Inves2ga2ng  uses  for  AWS  Lambda  and  image  processing For more info: @davidconde
  28. 28. DoneDeal  Dynamic  Test  Environments •QA  can  now  run  any  feature  branch  of  DoneDeal  directly  from  our  CI   server   •Uses  Jenkins  /  Docker  (Machine  +  Compose)  /  EC2  &  Route  53   •Enables  rapid  tes2ng  without  server  conten2on   •Also  used  by  the  mobile  team  to  develop  against  &  test  new  APIs   For more info: @davidconde
  29. 29. Q&A  Session Nigel Creighton CTO at DNM Martin Peters BI Manager at DoneDeal

    Be the first to comment

    Login to see the comments

  • haejuk99

    May. 10, 2016
  • UmapathyV

    Jan. 19, 2019
  • nourredineZaher

    Sep. 28, 2019
  • shanugandhi7

    Oct. 15, 2020

DoneDeal AWS Data Analytics Platform build using AWS products: EMR, Data Pipeline, S3, Kinesis, Redshift and Tableau. Custom built ETL was written using PySpark.

Views

Total views

291

On Slideshare

0

From embeds

0

Number of embeds

5

Actions

Downloads

4

Shares

0

Comments

0

Likes

4

×