Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Engineering Challenges - DSE Day at Bandung Institute of Technology

26 views

Published on

Data Engineering Challenges - DSE Day at Bandung Institute of Technology

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Data Engineering Challenges - DSE Day at Bandung Institute of Technology

  1. 1. Data Engineering Challenges DSE Days - 10 Sept 2015
  2. 2. Structure 1. Data Engineering 2. Data Pipeline 3. Data Engineering Challenges 4. Closing
  3. 3. 1. Data Engineering
  4. 4. All those buzzwords... - Data explosion, big data - Data scientist - IoT - Data driven company
  5. 5. Who is Data Engineer? “The role of data engineer is now used throughout industry to describe the highly specialized software engineers who create and maintain these robust big data pipelines.” - Insight Data Engineering Basically we are software engineers.
  6. 6. 2. Data Pipeline
  7. 7. Data Pipeline INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  8. 8. Lambda Architecture INGESTION Take it DATA MANAGEMENT Manage them BATCH PROCESSING Process it STORAGE Store it RETRIEVAL Use it STREAM PROCESSING Process it NOW
  9. 9. Big Data Pipeline
  10. 10. 3. Data Engineering Challenges
  11. 11. Challenges - Ingestion Throughput, availability, scalability INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  12. 12. Challenges - Ingestion Sample Problem: Facebook page view ~ 1 trillion/month 385,802 log or insert per second Sample Solution: Kafka, 2 million write/s (on 3 cheap machines) - Simple (Log) → Throughput, O(1) - Partitioning → Scalability - Replication → Availability
  13. 13. Challenges - Ingestion Challenge 1 - Wiring to Main App ● May introduce some changes in application Challenge 2 - Failure isolation ● Minimize failure in application when logging
  14. 14. Challenges - Processing Integrity, Dependency, Performance INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  15. 15. Challenges - Processing Sample Problem: How many page views are from Indonesia in Aug 2015? ~100PB data if 10kb/datum Sample Solution: ● Spark/Hadoop for computing ● HDFS for storing and Avro as file format ● Oozie as workflow management
  16. 16. Challenges - Processing Challenge 1 - Learning Curve ● New way of thinking in processing data: Map Reduce ● New technology and operational concerns Challenge 2 - Putting it All Together ● Incompatible release versions ● Minimum documentation
  17. 17. Challenges - Storage Efficiency, Performance INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  18. 18. Challenges - Storage Sample Problems: 1. We want to get number of daily page view from Indonesia for last 7 days 2. We want to retrieve user’s latest transaction to personalize search result better Sample Solution: 1. You might need Columnar Store for OLAP queries 2. You might need Key-Value Store since it will be retrieved per user id
  19. 19. Challenges - Storage Challenge 1 - Choosing the right storage ● There are so many kind of database nowadays. Pick it wisely to support your use cases best. Challenge 2 - Develop the right model ● Each database has different way to model data. Relational model might not be appropriate. We need to understand how the database work.
  20. 20. Challenges - Retrieval Ease of Use, Reusability, Adaptiveness INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  21. 21. Challenges - Retrieval Sample Problem: ● We want to visualize number of daily page view from Indonesia for last 7 days ● and other problems like ad hoc query and reporting Sample Solution: ● Create backend service to query and application to visualize query result
  22. 22. Challenges - Retrieval Challenge 1 - Ease of Use, Reusability ● It is very important to be easy to use since retrieval is user facing product. Data product have to be reusable and discoverable across data users. Challenge 2 - Adaptiveness ● As there are many kind of databases now, query service need to be extensible and adaptive to enable usage of data from various sources.
  23. 23. Challenges - Data Management INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  24. 24. Challenges - Data Management Challenge 1 - Centralized Metadata ● Manage data at various places, with various schema (sometime schemaless). Challenge 2 - Security, Access Control ● Most of them are newly developed, and usually security is last thing we consider.
  25. 25. 4. Closing
  26. 26. Takeaway Points ● Think critically ○ Be wise, don’t get carried away, do not use something just because it is cool, make sure you are using what you need. ● Keep curious ○ New technology is coming everyday, one of them might save your day
  27. 27. What is it like, to be a Data Engineer? ● Exhilarating ○ Be in critical position, handle big volume of data, be the nerve of company, and have to make sure pipeline is robust. ● Challenging ○ Have to be DBA, data architect, big data programmer, software engineer, and data analyst at the same time! ● Fun ○ Need to always learn new technology, new way to solve things ● High Demand ○ Data engineers are one of the most in-demand job roles at today’s leading companies.
  28. 28. Q&A
  29. 29. References ● http://insightdataengineering.com/blog/The- Data-Engineering-Ecosystem-An-Interactive- Map.html ● http://insightdataengineering.com/Insight_Da ta_Engineering_White_Paper.pdf

×