Your SlideShare is downloading. ×
Data Science Stack with MongoDB and RStudio
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Science Stack with MongoDB and RStudio

5,037

Published on

Building up an easy data science platform with RStudio server on top of your MongoDB …

Building up an easy data science platform with RStudio server on top of your MongoDB

Winston Chen – Lead Software Engineer

Published in: Engineering, Technology, Education
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,037
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
26
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Science Stack with MongoDB and RStudio Building up an easy data science platform with RStudio server on top of your MongoDB Winston Chen – Lead Software Engineer
  • 2. What does Fliptop do? • Predictive Lead Scoring, using data science – Pull opportunity/lead/contact data from CRM – Aggregate company data and social data from various data sources and the internet – Over 3000 signals – Build conversion/revenue model – Predict lead conversion and revenue
  • 3. Our Platform Stack • Java/Scala • Liftweb • JMS/Storm • MongoDB/MySql
  • 4. Our Machine Learning Stack • Python • Numpy/Scipy/Pandas • Bottle (RESTful Server)
  • 5. So, where is R then? • Problem: – Data is stored in MongoDB • Sales Lead Data • Sales Opportunity Data • Sales Contact Data – It’s hard to view/digest/process data on the fly using MongoDB console • (X) Text processing for insight extraction? • (X) Prototype cool machine learning algorithms on the fly? • Solution: – R and Rstudio Server • Why not scala? • Why not python/ipython
  • 6. MongoDB Console & Query
  • 7. Rstudio Server
  • 8. Pull MongoDB data into R data frame • rmongodb (https://github.com/gerald-lindsly/rmongodb) Transform Into a R data-frame
  • 9. 1 – Get the total count of your data set
  • 10. 2 – Construct Vectors for each column
  • 11. 3 – Loop through curser and insert values Where are my apply functions? - Too bad. We are using mongo cursor :P
  • 12. 4 – Go into sub bson block to extract data (optional)
  • 13. 5 – Construct data frame and return You are able to get the full example code here: http://goo.gl/tlyyXp We now have a data frame to play with from MongoDB bson.
  • 14. This is NOT a BIG DATA Stack • It takes around 1 min to process 900Mb+ of bson from Mongo. • NOT BIG data stack – Data should fit into the ram • Most of the data in the business world is not big anyways. • It works fine for us (m1.large machine in AWS) – CRM data is never big, not even after we pull in 3000+ additional signals. – The term ‘Big-Data’ is seriously overrated, ‘Data Science’ however, is the key term here.
  • 15. @Fliptop, we now use Rstudio to do • Data Insight Extraction • Algorithm prototyping
  • 16. If you REALLY want BIG Data • Look into: HDFS + Pig/Hive + Hue (any other suggestion from the audience here?)
  • 17. QA • Winston Chen – Personal Blog: http://winston.attlin.com/ – Twitter: @wingchen83 – winston@fliptop.com • Fliptop is hiring Data Scientists. Please email to: winston@fliptop.com

×