Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DevTalks Cluj - Open-Source Technologies for Analyzing Text

459 views

Published on

There are great open-source technologies for NLP (NLTK), machine learning (gensim, scikit-learn) and distribution computation (Spark). So don't shy away from big ideas, and make use of these amazing technologies at your fingertips!

Published in: Technology
  • Be the first to comment

DevTalks Cluj - Open-Source Technologies for Analyzing Text

  1. 1. An open source tech stack to Analyze all reviews on the Internet Steffen Wenz, CTO TrustYou steffen@trustyou.net
  2. 2. ✓ Very good hotel!* ✓ Near city centre “Close to the city center” ✓ Clean rooms « Chambre impeccable » ✓ Popular with solo travelers “Remote doesnt work” *) Ramada Cluj (Full summary)
  3. 3. DBCrawling Semantic Analysis TrustYou Analytics API Google, Hotels.com … TrustYou Architecture 200 million reqs/month ❤ Python
  4. 4. Scrapy ● Build your own web crawlers ● Extract data via CSS selectors, XPath, regexes … ● Handles “tag soup”, queuing, request parallelism, cookies, throttling … ● Code sample on GitHub
  5. 5. NLP in Python ● NLTK ○ Word/sentence tokenization ○ POS tagging, parsing ● Great support for scientific computation: NumPy, SciPy, Pandas ● Scikit-learn ● TensorFlow!
  6. 6. Gensim: Fun with Word2Vec >>> # trained from 100k meetup descriptions! >>> m = gensim.models.Word2Vec.load("data/word2vec") >>> m.most_similar(positive=["python"])[:3] [(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django', 0.8189617991447449)] >>> m.doesnt_match(["python", "c++", "javascript"]) 'c++' >>> m.most_similar(positive=["berlin"])[:3] [(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland', 0.7970746755599976)] >>> m.most_similar(positive=["ladies"])[:3] [(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
  7. 7. Big Data & Open Source 2004 MapReduce, GFS BigTable, Spanner, F1 … Apache Beam …
  8. 8. Spark ● User writes driver program which transparently schedules execution in a cluster ● Faster and more expressive than MapReduce ● Spark SQL: Interactive query of large datasets ● Spark Streaming: Spark is “batch first”, but fast enough to implement stream processing with “mini batches” ● Spark MLlib: Machine learning
  9. 9. ● Build complex pipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs ● Some support for Hadoop ● Pythonic replacement for Oozie Luigi
  10. 10. Try it out! GitHub repo showcasing: ● Luigi ● Scrapy ● Word2Vec model training with gensim @ https://github.com/trustyou/meetups
  11. 11. steffen@trustyou.com

×