Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Democratization at Nubank

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Life is a Stream of Events
Life is a Stream of Events
Loading in …3
×

Check these out next

1 of 64 Ad

Data Democratization at Nubank

Download to read offline

Nubank is the leading fintech in Latin America. Using bleeding-edge technology, design, and data, the company aims to fight complexity and empower people to take control of their finances. We are disrupting an outdated and bureaucratic system by building a simple, safe and 100% digital environment.

In order to succeed, we need to constantly make better decisions in the speed of insight, and that’s what We aim when building Nubank’s Data Platform. In this talk we want to explore and share the guiding principles and how we created an automated, scalable, declarative and self-service platform that has more than 200 contributors, mostly non-technical, to build 8 thousand distinct datasets, ingesting data from 800 databases, leveraging Apache Spark expressiveness and scalability.

The topics we want to explore are:
– Making data-ingestion a no-brainer when creating new services
– Reducing the cycle time to deploy new Datasets and Machine Learning models to production
– Closing the loop and leverage knowledge processed in the analytical environment to take decisions in production
– Providing the perfect level of abstraction to users

You will get from this talk:
– Our love for ‘The Log’ and how we use it to decouple databases from its schema and distribute the work to keep schemas up to date to the entire team.
– How we made data ingestion so simple using Kafka Streams that teams stopped using databases for analytical data.
– The huge benefits of relying on the DataFrame API to create datasets which made possible having tests end-to-end verifying that the 8000 datasets work without even running a Spark Job and much more.
– The importance of creating the right amount of abstractions and restrictions to have the power to optimize.

Nubank is the leading fintech in Latin America. Using bleeding-edge technology, design, and data, the company aims to fight complexity and empower people to take control of their finances. We are disrupting an outdated and bureaucratic system by building a simple, safe and 100% digital environment.

In order to succeed, we need to constantly make better decisions in the speed of insight, and that’s what We aim when building Nubank’s Data Platform. In this talk we want to explore and share the guiding principles and how we created an automated, scalable, declarative and self-service platform that has more than 200 contributors, mostly non-technical, to build 8 thousand distinct datasets, ingesting data from 800 databases, leveraging Apache Spark expressiveness and scalability.

The topics we want to explore are:
– Making data-ingestion a no-brainer when creating new services
– Reducing the cycle time to deploy new Datasets and Machine Learning models to production
– Closing the loop and leverage knowledge processed in the analytical environment to take decisions in production
– Providing the perfect level of abstraction to users

You will get from this talk:
– Our love for ‘The Log’ and how we use it to decouple databases from its schema and distribute the work to keep schemas up to date to the entire team.
– How we made data ingestion so simple using Kafka Streams that teams stopped using databases for analytical data.
– The huge benefits of relying on the DataFrame API to create datasets which made possible having tests end-to-end verifying that the 8000 datasets work without even running a Spark Job and much more.
– The importance of creating the right amount of abstractions and restrictions to have the power to optimize.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Data Democratization at Nubank (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Data Democratization at Nubank

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. @Nubank Andre Midea Rodrigo Ney Data Democratization #UnifiedDataAnalytics #SparkAISummit
  3. 3. Andre
 Midea Engineer @ Nubank /in/andremidea/  andremidea
  4. 4. Rodrigo Ney Engineer @ Nubank rodrigoney /in/rodrigoney/ 
  5. 5. 18 Styleguide Illustration Aplications 18
  6. 6. Credit card supported by a fully digital and branchless experience. 2014 2017 Our own version of a bank account, the simplest and most intelligent solution yet.
  7. 7. 2019 International Expansion
  8. 8. Growing Quickly A bank from scratch using Clojure
  9. 9. Roles making decisions Business Analyst Financial Analyst Legal Data-scientists Customer Support 1 2 3 4 5
  10. 10. We need to build a data platform
  11. 11. The Report Lifecycle Where good ideas perish
  12. 12. #UnifiedDataAnalytics #SparkAISummit Nooooo! Have you ever faced/seen this situation?
  13. 13. Business users engineers are too slow and don’t implement their ideas correctly and fatigue to implement new ideas because the process is too painful Engineers business users underestimate the challenges to implement something with a good level of quality Frustrations
  14. 14. Goal A data-platform where the incentives are aligned so that people are empowered to make more informed decisions, achieving that by reducing the friction for nontechnical people to create creative solutions using data.
  15. 15. the absence of hereditary or arbitrary class distinctions or privileges /democracy/
  16. 16. Agile DevOps DataOps1 Movements in the software community 2 3 Customer collaboration over contract negotiation. Relationships increases the sense of safety when we work together as partners. Decentralizing the maintenance of services, so every team is responsible for maintaining their services running, creating alerts, having SLA and SLOs Reduce heroism: As the pace and breadth of need for analytic insights ever increases, we believe analytic teams should strive to reduce heroism and create sustainable and scalable data analytic teams and processes.
  17. 17. a principled approach to data-engineering
  18. 18. Our Stack
  19. 19. •Functional •JVM •LISP
  20. 20. • Accumulate-only/ Immutable • Git for your data • Transaction log as high level API
  21. 21. • Lazy • Declarative • FP inspired
  22. 22. 330+ 315 400 engineers Micro-Services deploys per week 80 13+ 2100 TB Data SHARDS datomic transactors In production Resources: Architecting a Modern Financial Institution (InfoQ), Challenges and Benefits of an Immutable Database, Nubank Talk Youtube Playlist
  23. 23. Principle 1 Having data coverage is important. I mean, ALL the data!
  24. 24. Love your logs
  25. 25. Datomic EAVT e Entity Long a Attribute String v Value Any t Transaction point in time Long tx Transaction entity id Long txInstant Transaction wall-clock time java.util.Date op Operation (assertion / retraction) Boolean   28
  26. 26. [ 28 ':name' 'john' ] 29
  27. 27. entity attribute value [ 28 ':name' 'john' ] 30
  28. 28. [ 28 ':name' 'john' ] 'lennon' entity attribute value 31
  29. 29. [ 28 ':name' 'john' Tx₁ true] [ 28 ':name' 'john' Tx₂ false] [ 28 ':name' 'lennon' Tx₂ true] entity attribute value transaction op 32
  30. 30. (let [log (datomic.api/log datomic) t1 10000 t2 50000] (datomic.api/tx-range log t1 t2)) => [Datom(:e 1234, :a :transaction/value :value 10.00 :t 10000)...]
  31. 31. Infrastructure as code {:name :diablo  :canary {:type :shard}  :datastores  {:datomic {:databases    [{:transactor “diablo”      :name “diablo”}]}   :kafka {:enabled? true}}  :environments  {… :prod #nu/prototypes-for [:prod :sharded]} …  :pipelines  #{{:type :clojure-service     :prod {:promotion :automatic}     :cdc-test-frameworks #{:sachem}}} {:name :diablo  :canary {:type :shard}  :datastores  {:datomic {:databases    [{:transactor “diablo”      :name “diablo”}]}   :kafka {:enabled? true}}  :environments  {… :prod #nu/prototypes-for [:prod :sharded]}  :pipelines  #{{:type :clojure-service     :prod {:promotion :automatic}     :cdc-test-frameworks #{:sachem}}}}
  32. 32. prod-schema analytical-schema
  33. 33. Principle 2 No need for excess of features Pave the right roads!
  34. 34. - Nathan Marz. “Big Data: Principles and best practices of scalable realtime data systems”. batch view = function(all data) “The portion of the Lambda Architecture that implements the: equation is called the batch layer. The batch layer stores the master copy of the dataset and precomputes batch views on that master dataset. The master dataset can be thought of as a very large list of records.”
  35. 35. val report: (allTheData: Map[String, DataFrame]) => DataFrame
  36. 36. Make datasets reusable by default Extend the same abstraction“ “ f(1,2) f(3,4) Dataset 1 Dataset 2 Dataset 4 Dataset 3 Dataset 5
  37. 37. trait SparkOperation { val name: String val inputs: Set[String] val definition: (inputs: Map[String, DataFrame]) => DataFrame }
  38. 38. object OmbudsmanCalls extends SparkOp{ override val name: DataFrame = "dataset/ombudsman-calls" override val inputs: Set[String] = Set(callsName) override def definition(datasets: Map[String, DataFrame]): DataFrame = { val calls = datasets(callsName) (calls where ($"started_at".isNotNull and $"our_number" === phoneNumber) select ($"started_at", $"call__id", $"our_number")) } def phoneNumber: String = "999-9999" def callsName: String = “contract-phone/calls” }
  39. 39. Yeah, but nobody will use this… It’s hard! Spark and Scala… We managed to have more than +300 business users working with our abstraction. More than 15 Pull Requests per day. ~500k LOC (only datasets)
  40. 40. Principle 3 KISS Keep it Simple Stupid! B Batch
  41. 41. Donald Knuth “… 97% of the time: premature optimisation is the root of all evil. Yet we should not pass up our opportunities in that critical 3%”
  42. 42. TYPE USER CAN USE? RAW CONTRACT DATASET SUPPORT Platform Maintained Platform Maintained & User Generated User Defined 45
  43. 43. raw/log raw/log raw/log contract contract contract user data User Land user data user data
  44. 44. user data User Land user data user data cache cache cache where t > t’' where t > t’' where t > t’' t’' t’' t’' raw/log raw/log raw/log contract contract contract
  45. 45. raw/log user dataset Immutable/Append only user dataset user dataset user dataset user dataset user dataset Daily Effects Introduce a Change The dataset breaks raw/log raw/log
  46. 46. raw/log raw/log raw/log user dataset Immutable/Append only user dataset user dataset user dataset user dataset user dataset Next Day Introduce a fix
  47. 47. Safety Nets Keep your DAG valid at all times! Principle 4
  48. 48. Complex
  49. 49. Spark Dataframe Raw layer Test DataFrame eavt schema We know the schema of all raw datasets!
  50. 50. Spark Dataframe Raw layer We can pipe the DataFrame from one operation downstream in the lineage R’ R’’ R’’’ R’’’’eavt schema Test DataFrame f(R’) f(R’’) f(R’’’) f(R’’’’) User Land It's lazy! u’ u’’ u’’’ u’’’’
  51. 51. Spark Dataframe Raw layer We can pipe the DataFrame from one operation downstream in the lineage R’ R’’ R’’’ R’’’’eavt schema Test DataFrame f(R’) f(R’’) f(R’’’) f(R’’’’) User Land It's lazy! u’ u’’ u’’’ u’’’’ f(U’,U’’) 2u’ 2u’’ f(U’’’,U’’’’)
  52. 52. Spark Dataframe Raw layer All Spark transformations are valid! R’ R’’ R’’’ R’’’’eavt schema Test DataFrame f(R’) f(R’’) f(R’’’) f(R’’’’) User Land u’ u’’ u’’’ u’’’’ f(U’,U’’) 2u’ 2u’’ f(U’’’,U’’’’) 3u’ 3u’’ 4u’ f(…) f(…) f(…)
  53. 53. • Integration Tests • Consumer Driven Contracts • Integrity Checks • Anomaly Detection based on a dataset's statics over time
  54. 54. Think, Iterate, Deploy… Think, Iterate Deploy Easily! Principle 5
  55. 55. REPL environment closer to production
  56. 56. new(service + db) logs pivoted contract user data dataset data warehousebi tool back to production… automatically! Meanwhile….
  57. 57. Results
  58. 58. 125 250 375 500 3000 6000 9000 12000 2016,Q3 2016,Q4 2017,Q1 2017,Q2 2017,Q3 2017,Q4 2018,Q1 2018,Q2 2018,Q3 2018,Q4 2019,Q1 2019,Q2 2019,Q3 2019,Q4 Datasets dataset per data engineer 11523 9246.283 7247.033 5686.479 2946.159 3804.225 2835.279 2289.881 1707.904 1034.355761.992 240.07759.7333.3333.333 59.733 240.077 761.992 1034.355 1707.904 2289.881 2835.279 3804.225 2946.159 5686.479 7247.033 9246.283 11523 Data Engineer vs Datasets That discrepancy is even more pronounced when we look at the number of datasets generated per data-engineer
  59. 59. “ “ “1. Think carefully in the roads you want to pave, and pave only a handful of them 2. Constraints are a good thing, if they hide complexity away users will be glad by them 3. Leverage the Transaction Log KEY TAKEAWAYS “ “ “4. Automate Data- Ingestion to the extreme 5. Have a Dev environment closer to production 6. Validate the DAG in test time, and create invariants for the datasets.
  60. 60. we are hiring nubank.com.br/en/careers/
  61. 61. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×