Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data analysis in python @ 2013


Published on

Big data analysis involves several processes: collecting, storage, computing, analysis and visualization. In this slides, the author demonstrates these processes by using python tools to build a data product. The example is based on text-analyzing an online forum.

Published in: Technology
  • Really readable slides. Thanks! I have taken Python data analysis course using pandas. To me, pandas is pretty interesting topic and I got clear about it. I really recommend to take that course that I took.
    Are you sure you want to  Yes  No
    Your message goes here

Big data analysis in python @ 2013

  1. 1. Jimmy Lair97922028 [at] slides:
  2. 2. Outlines1. Overview2. Big Data Analysis Example3. Scrapy4. Mongodb5. Solr6. Scikit-learnBig Data Anysis in Python 2
  3. 3. When Big Data meet Python3CollectingUser GeneratedContentMachineGenerated DataScrapy: scraping frameworkPyMongo: Python client for MongodbHadoop streaming: Linux pipe interfaceDisco: lightweight MapReduce in PythonStorageComputingAnalysisVisualizationPandas: data analysis/manipulationStatsmodels: statisticsNLTK: natural language processingScikit-learn: machine learningSolr: full text search by REST APIMatplotlib: plottingNetworkX: graph visualizationBig Data Anysis in PythonInfrastructure
  4. 4. Why Python?• Good code readabilityfor fast development.• Scripting language:the less code, themore productivity.• Application:– Web– GUI– OS– Science• Fast growing amongopen sourcecommunities.– Commits statisticsfrom ohloh.net4
  5. 5. Why Python?• The support of open source projectsBig Data Anysis in Python 5
  6. 6. Big Data Analysis ExampleBig Data Anysis in Python 6
  7. 7. Big Data Analysis ExampleBig Data Anysis in Python 7
  8. 8. web scraping framework• pip install scrapy• Parse field by XPath• Spider– Request– Response– Parse_function– Pipeline• For Big Data:– Parallel crawlingBig Data Anysis in Python 8 web pages
  9. 9. Big Data Anysis in Python 9Parse articleParse comentsweb scraping framework
  10. 10. class GroupSpider(BaseSpider):name = "group"allowed_domains = [""]start_urls = ["" % ifor i in range(10000, 20000, 25)]def __init__(self):self.coll = get_coll()def parse(self, response):hxs = HtmlXPathSelector(response)for tr in, url, author, author_url = None, None, None, Noneresult =[@class="title"])if len(result) == 1:url =[0]if self.coll.find_one({url: url}) is not None:self.log(duplicated url: %s % (url))else:yield Request(url, callback=self.parse_article)Big Data Anysis in Python 10Traverse web pagesExtract field by XPath
  11. 11. Document based NoSQL database• Python client: pip installpymongo• Json style document• For Big Data:– Mongodb cluster withshard and replica• article:– author– author_url– title– content– comments– urlBig Data Anysis in Python 11 pymongoconn = pymongo.Connection()coll = conn.database.collectioncoll.ensure_index(url, unique=True)article = {author: u小明, author_url:http://host/path1, comments: [],title: u天天向上, content: , url:http://host/path2}coll.insert(article)results = [i for i in coll.find({title: u天天向上})]
  12. 12. Full-text search engineBig Data Anysis in Python 12• Customize full-textindexing andsearching by xmlconfig• RESTful API forselect/update index• Spatial Search• For Big Data:– SolrCloud:distributed indexs
  13. 13. Big Data Anysis in Python 13<field name="url" type="text_general" indexed="true" stored="true"/><field name="title" type="text_general" indexed="true" stored="true"/><field name="content" type="text_general" indexed="true" stored="true"/><field name="comments" type="text_general" indexed="true" stored="true"multiValued="true"/><field name="author" type="text_general" indexed="true" stored="true"/><field name="author_url" type="text_general" indexed="true" stored="true"/><copyField source="title" dest="text"/><copyField source="author" dest="text"/><copyField source="content" dest="text"/><copyField source="comments" dest="text"/><fieldType name="text_smart_chinese" class="solr.TextField"positionIncrementGap="100"><tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/><filter class="solr.SmartChineseWordTokenFilterFactory"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"enablePositionIncrements="true" /><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.PorterStemFilterFactory"/></fieldType><uniqueKey>url</uniqueKey>Define fileds and typesschema.xmlDefine index process
  14. 14. Big Data Anysis in Python 14
  15. 15. Popular Book Ranking ListBy regex search in Mongodb By full-text search in Solr少有人走的路 那些年我们一起追的女孩百年孤独 了不起的盖茨比遇见未知的自己 我不要你死于一事无成平凡的世界 考拉小巫的英语学习日记送你一颗子弹 高效能人士的7个习惯小王子 哪来的天才拖延心理学 被嫌弃的松子的一生活着 与众不同的心理学苏菲的世界 奇特的一生红楼梦 普罗旺斯的一年目送 接纳不完美的自己如何阅读一本书 我就是想停下来,看看这个世界围城 这些都是你给我的爱Big Data Anysis in Python 15Search “《[^》]+》” pattern
  16. 16. Machine Learning• Machine learnalgorithms– Supervised learning– Unsupervisedlearning• Processing– Model selection– PipelineBig Data Anysis in Python 16• Application: ArticleRecommendation– Let computer figuresout your interests,and then recommendarticles for you.– Learning byOneClassSVM
  17. 17. Big Data Anysis in Python 17articles = [coll.find_one({url: url})[content] for url in urls]tfidf = TfidfVectorizer(tokenizer=my_tokenizer,ngram_range=(1, 3))svm = OneClassSVM(kernel="linear", nu=0.3)train_vectors = tfidf.fit_transform(articles)• Result: recommend 390 from 8000 articles, and therecommended articles are good to me.Pickup somefavoritearticlesExtractfeatures fromarticle and letmodel learnfrom thefeaturesModel predicton unseenarticles andrecommendfor youArticle Recommendation
  18. 18. Numerical Data Analysis• Data Structure:– Series– Frame• Operations: slicing,indexing, subsettingBig Data Anysis in Python 18from pandas import Seriesimport matplotlib.pyplot as pltsa = Series(index=range(len(commenters)),data=commenters)fig, axs = plt.subplots(1, 3)sa.plot(ax=axs[0], title=number ofcommenter, style=r.)
  19. 19. Reference (1/2)• Book:– Wes McKinney, “Python for Data Analysis”,OReilly, 2012– Toby Segaran, “Programming CollectiveIntelligence: Building Smart Web 2.0Applications”, OReilly, 2008– Philipp K. Janert, “Data Analysis with OpenSource Tools”, OReilly, 2010• Coursera:– Web Intelligence and Big Data, url– Introduction to Data Science, url– Data Analysis, urlBig Data Anysis in Python 19
  20. 20. Reference (2/2)• Conference:– PyData 2013,– PyData 2012,• My former shares:– When Big Data Meet Python, COSCUP2012, slides– NLTK: natural language toolkit overviewand application, 2012, slidesBig Data Anysis in Python 20