Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark

172 views

Published on

Traditional data architectures are not enough to handle the huge amounts of data generated from millions of users. In addition, the diversity of data sources are increasing every day: Distributed file systems, relational, columnar-oriented, document-oriented or graph Databases.

Letgo has been growing quickly during the last years. Because of this, we needed to improve the scalability or our data platform and endow it further capabilities, like “dynamic infrastructure elasticity”, real-time processing or real-time complex event processing. In this talk, we are going to dive deeper into our journey. We started from a traditional data architecture with ETL and Redshift, till nowadays where we successfully have made an event oriented and horizontally scalable data architecture.

We will explain in detail from the event ingestion with Kafka / Kafka Connect to its processing in streaming and batch with Spark. On top of that, we will discuss how we have used Spark Thrift Server / Hive Metastore as glue to exploit all our data sources: HDFS, S3, Cassandra, Redshift, MariaDB … in a unified way from any point of our ecosystem, using technologies like: Jupyter, Zeppelin, Superset … We will also describe how to made ETL only with pure Spark SQL using Airflow for orchestration.

Along the way, we will highlight the challenges that we found and how we solved them. We will share a lot of useful tips for the ones that also want to start this journey in their own companies.

Published in: Data & Analytics
  • Be the first to comment

Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark

  1. 1. Ricardo Fanjul, letgo Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark #SAISExp2
  2. 2. 2 0 1 8 : DATA O DY S S E Y
  3. 3. Ricardo Fanjul Data Engineer
  4. 4. Founded in 2015 100MM+ downloads 400MM+ listings #SAISExp2
  5. 5. L E T G O D ATA P L AT F O R M I N N U M B E R S 500GB Data daily Events Processed Daily 1 billion 50K Peaks of events per Second 600+ Event Types 200TB Storage (S3) < 1sec NRT Processing Time #SAISExp2
  6. 6. T H E DAW N O F L E T G O
  7. 7. #SAISExp2 C L A S S I C A L B I P L AT F O R M T H E D A W N O F L E T G O #SAISExp2
  8. 8. C L A S S I C A L B I P L AT F O R M T H E D A W N O F L E T G O #SAISExp2
  9. 9. M OV I N G TO - S E RV I C E S A N D E V E N T S T H E D A W N O F L E T G O μ #SAISExp2
  10. 10. M OV I N G TO - S E RV I C E S A N D E V E N T S T H E D A W N O F L E T G O Domain EventsTracking Events μ #SAISExp2
  11. 11. Data Ingestion Storage INGEST Stream Batch PROCESS Query Data exploitation Orchestration DISCOVER #SAISExp2
  12. 12. I N G E S T
  13. 13. Data Ingestion Storage INGEST PROCESS Stream Batch DISCOVER Query Data exploitation Orchestration Data Ingestion #SAISExp2
  14. 14. O U R G OA L DATA I N G E S T I O N #SAISExp2
  15. 15. T H E D I S C OV E RY DATA I N G E S T I O N #SAISExp2
  16. 16. K A F K A C O N N E C T DATA I N G E S T I O N Amazon Aurora #SAISExp2
  17. 17. T H E J O U R N E Y B E G I N S
  18. 18. Data Ingestion INGEST PROCESS Stream Batch DISCOVER Query Data exploitation Orchestration Storage #SAISExp2
  19. 19. B U I L D I N G T H E DATA L A K E S TO R AG E #SAISExp2
  20. 20. B U I L D I N G T H E DATA L A K E S TO R AG E We want to store all events coming from Kafka to S3. #SAISExp2
  21. 21. B U I L D I N G T H E DATA L A K E S TO R AG E #SAISExp2
  22. 22. S O M E T I M E S … S H I T H A P P E N S S TO R AG E #SAISExp2
  23. 23. D U P L I C AT E D E V E N T S S TO R AG E #SAISExp2
  24. 24. D U P L I C AT E D E V E N T S S TO R AG E #SAISExp2
  25. 25. ( V E RY ) L AT E E V E N T S S TO R AG E #SAISExp2
  26. 26. ( V E RY ) L AT E E V E N T S S TO R AG E 1 2 3 4 5 6 Dirty Buckets 1 Read batch of events from Kafka. 2 Write each event to Cassandra. 3 Write “dirty hours” to compact topic: Key=(event_type, hour). 4 Read dirty “hours” topic. 5 Read all events with dirty hours. 6 Store in S3 #SAISExp2
  27. 27. S 3 P RO B L E M S S TO R AG E #SAISExp2
  28. 28. S 3 P RO B L E M S S TO R AG E 1. Eventual consistency 2. Very slow renames S O M E S 3 B I G DATA P RO B L E M S :
  29. 29. S 3 P RO B L E M S : E V E N T UA L C O N S I S T E N C Y S TO R AG E #SAISExp2
  30. 30. S 3 P RO B L E M S : E V E N T UA L C O N S I S T E N C Y S TO R AG E S3GUARD S3AFileSystem The Hadoop FileSystem for Amazon S3 FileSystem Operations S3 ClientDynamoDB Client 1 WRITE fs metadata READ fs metadata WRITE object data LIST object info LIST object data Write Path Read Path 2 3 1 2 2 1 1 2 3
  31. 31. S 3 P RO B L E M S : S L OW R E N A M E S S TO R AG E 
 ¿Job freeze? #SAISExp2
  32. 32. S 3 P RO B L E M S : S L OW R E N A M E S S TO R AG E New Hadoop 3.1 S3A committers: • Directory • Partitioned • Magic #SAISExp2
  33. 33. P R O C E S S
  34. 34. INGEST PROCESS Batch DISCOVER Query Data exploitation Orchestration StreamData Ingestion Storage #SAISExp2
  35. 35. R E A L T I M E U S E R S E G M E N TAT I O N S T R E A M Stream Journal User bucket’s changed 1 2 #SAISExp2
  36. 36. R E A L T I M E U S E R S E G M E N TAT I O N S T R E A M JOURNAL User bucket’s changed 1 2 STREAM #SAISExp2
  37. 37. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M Is it still available? What condition is it in? Could we meet at……? Is the price negotiable? I offer you….$ #SAISExp2
  38. 38. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M { "type": "meeting_proposal", "properties": { "location_name": “Letgo HQ", "geo": { "lat": "41.390205", "lon": "2.154007" }, "date": "1511193820350", "meeting_id": "23213213213" } } Structured data #SAISExp2
  39. 39. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M Meeting proposed + meeting accepted = emit accepted-meeting event Meeting proposed + nothing in X time = “You have a proposal to meet” #SAISExp2
  40. 40. Data Ingestion Storage INGEST PROCESS DISCOVER Query Data exploitation Orchestration Stream Batch #SAISExp2
  41. 41. G E O DATA E N R I C H M E N T B AT C H #SAISExp2
  42. 42. G E O DATA E N R I C H M E N T B AT C H { "data": { "id": "105dg3272-8e5f-426f- bca0-704e98552961", "type": "some_event", "attributes": { "latitude": 42.3677203, "longitude": -83.1186093 } }, "meta": { "created_at": 1522886400036 } } 
 Technically correct but not very actionable #SAISExp2
  43. 43. G E O DATA E N R I C H M E N T B AT C H •City: Detroit •Postal code: 48206 •State: Michigan •DMA: Detroit •Country: US What we know: (42.3677203, -83.1186093) #SAISExp2
  44. 44. G E O DATA E N R I C H M E N T B AT C H How we do it: • Populating JTS indices from WKT polygon data • Custom Spark SQL UDF SELECT geodata.dma_name, geodata.dma_number AS dma_number, geodata.city AS city, geodata.state AS state , geodata.zip_code AS zip_code FROM ( SELECT geodata(longitude, latitude) AS geodata FROM …. ) 
 #SAISExp2
  45. 45. D I S C O V E R
  46. 46. Data Ingestion Storage INGEST PROCESS DISCOVER Data exploitation Orchestration Stream Batch Query #SAISExp2
  47. 47. Stream Journal User bucket’s changed 1 2 QU E RY I N G DATA QU E RY #SAISExp2
  48. 48. Stream Journal User bucket’s changed 1 2 QU E RY I N G DATA QU E RY #SAISExp2
  49. 49. Stream Journal User bucket’s changed 1 2 QU E RY I N G DATA QU E RY Amazon Aurora
  50. 50. QU E RY I N G DATA QU E RY METASTORE SPECTRUM
  51. 51. QU E RY I N G DATA QU E RY #SAISExp2
  52. 52. QU E RY I N G DATA QU E RY METASTORE Thrift Server Amazon Aurora
  53. 53. QU E RY I N G DATA QU E RY CREATE TABLE IF NOT EXISTS database_name.table_name( some_column STRING, ... dt DATE ) USING json PARTITIONED BY (`dt`) CREATE TEMPORARY VIEW table_name USING org.apache.spark.sql.cassandra OPTIONS ( table "table_name", keyspace "keyspace_name") CREATE EXTERNAL TABLE IF NOT EXISTS database_name.table_name( some_column STRING..., dt DATE ) PARTITIONED BY (`dt`) USING PARQUET LOCATION 's3a://bucket-name/database_name/table_name' CREATE TABLE IF NOT EXISTS database_name.table_name using com.databricks.spark.redshift options ( dbtable 'schema.redshift_table_name', tempdir 's3a://redshift-temp/', url 'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo? user=xxx&password=xxx', forward_spark_s3_credentials 'true')
  54. 54. QU E RY I N G DATA QU E RY CREATE TABLE … STORED AS… CREATE TABLE … USING [parquet,json,csv…] 70% Higher performance!
  55. 55. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Creating the table Inserting data 1 2 #SAISExp2
  56. 56. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Creating the table 1 CREATE EXTERNAL TABLE IF NOT EXISTS database.some_name(   user_id STRING,   column_b STRING, ... ) USING PARQUET PARTITIONED BY (`dt` STRING) LOCATION 's3a://example/some_table' #SAISExp2
  57. 57. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Inserting data 2 INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... #SAISExp2
  58. 58. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Problem? #SAISExp2
  59. 59. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY 
 200 files because default value of “spark.sql.shuffle.partition”
  60. 60. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... ? #SAISExp2
  61. 61. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY ? DISTRIBUTE BY (dt): Only one file not Sorted CLUSTERED BY (dt, user_id, column_b): Multiple files DISTRIBUTE BY (dt) SORT BY (user_id, column_b): Only one file sorted by user_id, column_b. Good for joins using this properties. #SAISExp2
  62. 62. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... DISTRIBUTE BY (dt) SORT BY (user_id) #SAISExp2
  63. 63. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY #SAISExp2
  64. 64. Data Ingestion Storage INGEST PROCESS DISCOVER Orchestration Stream Batch Query Data exploitation #SAISExp2
  65. 65. O U R A N A LY T I C A L S TAC K DATA E X P L O I TAT I O N #SAISExp2
  66. 66. O U R A N A LY T I C A L S TAC K DATA E X P L O I TAT I O N Amazon Aurora METASTORE Thrift Server
  67. 67. DATA S C I E N T I S T S T E A M ? DATA E X P L O I TAT I O N #SAISExp2
  68. 68. DATA S C I E N T I S T S A S I S E E T H E M DATA E X P L O I TAT I O N #SAISExp2
  69. 69. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N #SAISExp2
  70. 70. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Too many small files! #SAISExp2
  71. 71. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Huge Query! #SAISExp2
  72. 72. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Too much shuffle! #SAISExp2
  73. 73. Data Ingestion Storage INGEST PROCESS DISCOVER Stream Batch Query Data exploitation Orchestration #SAISExp2
  74. 74. A I R F L OW O R C H E S T R AT I O N #SAISExp2
  75. 75. A I R F L OW O R C H E S T R AT I O N #SAISExp2
  76. 76. A I R F L OW O R C H E S T R AT I O N I’m happy!! #SAISExp2
  77. 77. M OV I N G TO S TAT E L E S S C L U S T E R
  78. 78. I ’ M S O R RY R I C A R D O. I ’ M A F R A I D I C A N ’ T D O T H AT.
  79. 79. P L AT F O R M L I M I TAT I O N S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  80. 80. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  81. 81. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  82. 82. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  83. 83. A L O N G A N D W I N D I N G P RO C E S S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  84. 84. A L O N G A N D W I N D I N G P RO C E S S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  85. 85. N E W C A PA B I L I T I E S O F T H E P L AT F O R M M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  86. 86. N E W C A PA B I L I T I E S O F T H E P L AT F O R M M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  87. 87. J U P I T E R A N D B E YO N D T H E I N F I N I T E
  88. 88. S O M E T H I N G WO N D E R F U L W I L L H A P P E N J U P I T E R A N D B E YO N D T H E I N F I N I T E #SAISExp2
  89. 89. R E V E A L J U P I T E R A N D B E YO N D T H E I N F I N I T E https://youtu.be/CInMDMuSFwc
  90. 90. – A R T H U R C . C L A R K E “The only way to discover the limits of the possible is to go beyond them into the impossible.” #SAISExp2
  91. 91. D O YO U WA N T TO J O I N U S ? T H E F U T U R E … ? https://boards.greenhouse.io/letgo

×