Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lenddo - Data Driven NYC (27)

Lenddo's CEO and CTO, Jeff Stewart and Naveen Agnihotri, presented at May's edition of Data Driven NYC, which focused on p2p lending.

  • Login to see the comments

Lenddo - Data Driven NYC (27)

  1. 1. Empowering the Emerging Market Middle Class Big Data is not Big Database Jeff Stewart - CEO Naveen Agnihotri, PhD - CTO
  2. 2. “If you look 5 years out, every industry is going to be rethought in a social way”. -Mark Zuckerberg, 2010
  3. 3. ● Founded in January 2011 ● Over 500K members around the world ● Integrated with Facebook, LinkedIn, Google, Yahoo, Twitter ● Services oriented architecture (LAMP) ○ Front end (clients) in PHP ○ Services in PHP and Python ● Technical team based in NY and PH ● Data science team based in NY LENDDO TECH FACTS
  4. 4. Finance in the Age of Social Networks Lenddo maintains the worlds largest Opt-in, TrustGraph, for trustworthiness and risk management Lenddo is…. Social Social sourcing & screening Peer enforcement New data sets Algorithms Unprecedented processing power Real-time / ongoing risk management Targeting, underwriting & collections Cloud Rich risk analytic data set Unprecedented processing power Global Mobile New datasets 24/7 engagement new cost structures
  5. 5. Why Finance Works Better with Lenddo Traditional • Negative selection bias • Costly • Fact verification time consuming • Scores incomplete or unavailable • No peer enforcement • Labor intensive • Hard to maintain contact DEMAND GENERATION UNDERWRITING COLLECTIONS • Digital, fast and potentially viral • Less Expensive • Social nature cause positive selection bias • Reduced Fraud and default • Big data and powerful algorithms • Larger addressable market • Easily automatable • Potential for peer enforcement • Lower cost • More points of contact With Lenddo
  6. 6. Source: ID Verification is easier online
  7. 7. ● 100% infrastructure on AWS ● Store social data from all online social networks ● Opt-in Social data storage grows about 10 times faster than member data ● Social data currently about 3.5 TB ● Largest table (dataset) is > 2 billion records LENDDO SOCIAL DATA
  8. 8. GOOD AND BAD BORROWERS 26 n=1347
  9. 9. CLUSTERS 27
  10. 10. LOAN SCORE IMPROVEMENT 24 No NLP or network
  11. 11. LOAN SCORE IMPROVEMENT 24 No NLP or network With NLP and network
  12. 12. WORD CLUSTERS 17 Words associate closely together, and can be commonly associated with good or bad loans.
  13. 13. WORDS AND LOAN QUALITY 18 % Association with BAD loans % Association with GOOD loans
  14. 14. ● Started with MongoDB for social data storage ● As use cases grew, we added indexes SOCIAL DATA STORAGE HISTORY
  15. 15. SOCIAL DATA STORAGE User data Social data
  16. 16. SOCIAL DATA STORAGE Social data User data
  17. 17. ● We moved to larger and larger servers ○ At last iteration, used cr1.8xlarge server ○ 32 CPUs, 244 GB RAM ○ Still couldn’t keep up with index size ● Data acquisition speeds increased ○ provisioned IOPS to the rescue! ● Total cost of social data storage: > $10,000 per month ● And we want to grow faster! SOCIAL DATA STORAGE HISTORY
  18. 18. ● Simple queries (by index) ● Complex queries (by multiple indexes) ● Pull out all data for a member ● Aggregate all data for a member ● Calculate score for a member ● Aggregate all data for all members ● Calculate score for all members SOCIAL DATA STORAGE HISTORY
  19. 19. ? REVELATION: 2013
  20. 20. It’s “BIG DATA” not “BIG DATABASE” REVELATION: 2013
  21. 21. ● Moved all data to Amazon S3 ● Data model remains largely unchanged ● Hadoop compatible storage format ○ Avro format ○ Snappy compressed, chunked ● Created a small ‘cache’ type MongoDB ○ stores recent data temporarily ● Using DynamoDB for longer-lived data that needs to be queried all the time SOCIAL DATA REVAMP - 2013
  22. 22. ● Use the cache for data when it first arrives ○ Data is available for quick computations and ● Move data from cache to S3 at the end of the day ● Use EMR over S3 data for all aggregations ● Created a EMR based map-reduce framework for data science team ● Standard EMR jobs for common queries: ○ All social data for a member ○ Score one member ○ Score all members NEW SOCIAL DATA USAGE
  23. 23. ● Peace of mind ○ No more database maintenance ○ No more periodic server upgrades ● Scalability ○ Storage and access remains identical for the next 10x growth ● $$$ ○ New cost: < $3000 per month: 70% less! ○ Includes EMR clusters running routine jobs WHAT DID WE GAIN?
  24. 24. Thanks!