Big data analysis in python @ 2013


Published on

Big data analysis involves several processes: collecting, storage, computing, analysis and visualization. In this slides, the author demonstrates these processes by using python tools to build a data product. The example is based on text-analyzing an online forum.

Published in: Technology
1 Comment
  • Really readable slides. Thanks! I have taken Python data analysis course using pandas. To me, pandas is pretty interesting topic and I got clear about it. I really recommend to take that course that I took.
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big data analysis in python @ 2013

  1. 1. Jimmy Lair97922028 [at] slides:
  2. 2. Outlines1. Overview2. Big Data Analysis Example3. Scrapy4. Mongodb5. Solr6. Scikit-learnBig Data Anysis in Python 2
  3. 3. When Big Data meet Python3CollectingUser GeneratedContentMachineGenerated DataScrapy: scraping frameworkPyMongo: Python client for MongodbHadoop streaming: Linux pipe interfaceDisco: lightweight MapReduce in PythonStorageComputingAnalysisVisualizationPandas: data analysis/manipulationStatsmodels: statisticsNLTK: natural language processingScikit-learn: machine learningSolr: full text search by REST APIMatplotlib: plottingNetworkX: graph visualizationBig Data Anysis in PythonInfrastructure
  4. 4. Why Python?• Good code readabilityfor fast development.• Scripting language:the less code, themore productivity.• Application:– Web– GUI– OS– Science• Fast growing amongopen sourcecommunities.– Commits statisticsfrom ohloh.net4
  5. 5. Why Python?• The support of open source projectsBig Data Anysis in Python 5
  6. 6. Big Data Analysis ExampleBig Data Anysis in Python 6
  7. 7. Big Data Analysis ExampleBig Data Anysis in Python 7
  8. 8. web scraping framework• pip install scrapy• Parse field by XPath• Spider– Request– Response– Parse_function– Pipeline• For Big Data:– Parallel crawlingBig Data Anysis in Python 8 web pages
  9. 9. Big Data Anysis in Python 9Parse articleParse comentsweb scraping framework
  10. 10. class GroupSpider(BaseSpider):name = "group"allowed_domains = [""]start_urls = ["" % ifor i in range(10000, 20000, 25)]def __init__(self):self.coll = get_coll()def parse(self, response):hxs = HtmlXPathSelector(response)for tr in, url, author, author_url = None, None, None, Noneresult =[@class="title"])if len(result) == 1:url =[0]if self.coll.find_one({url: url}) is not None:self.log(duplicated url: %s % (url))else:yield Request(url, callback=self.parse_article)Big Data Anysis in Python 10Traverse web pagesExtract field by XPath
  11. 11. Document based NoSQL database• Python client: pip installpymongo• Json style document• For Big Data:– Mongodb cluster withshard and replica• article:– author– author_url– title– content– comments– urlBig Data Anysis in Python 11 pymongoconn = pymongo.Connection()coll = conn.database.collectioncoll.ensure_index(url, unique=True)article = {author: u小明, author_url:http://host/path1, comments: [],title: u天天向上, content: , url:http://host/path2}coll.insert(article)results = [i for i in coll.find({title: u天天向上})]
  12. 12. Full-text search engineBig Data Anysis in Python 12• Customize full-textindexing andsearching by xmlconfig• RESTful API forselect/update index• Spatial Search• For Big Data:– SolrCloud:distributed indexs
  13. 13. Big Data Anysis in Python 13<field name="url" type="text_general" indexed="true" stored="true"/><field name="title" type="text_general" indexed="true" stored="true"/><field name="content" type="text_general" indexed="true" stored="true"/><field name="comments" type="text_general" indexed="true" stored="true"multiValued="true"/><field name="author" type="text_general" indexed="true" stored="true"/><field name="author_url" type="text_general" indexed="true" stored="true"/><copyField source="title" dest="text"/><copyField source="author" dest="text"/><copyField source="content" dest="text"/><copyField source="comments" dest="text"/><fieldType name="text_smart_chinese" class="solr.TextField"positionIncrementGap="100"><tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/><filter class="solr.SmartChineseWordTokenFilterFactory"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"enablePositionIncrements="true" /><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.PorterStemFilterFactory"/></fieldType><uniqueKey>url</uniqueKey>Define fileds and typesschema.xmlDefine index process
  14. 14. Big Data Anysis in Python 14
  15. 15. Popular Book Ranking ListBy regex search in Mongodb By full-text search in Solr少有人走的路 那些年我们一起追的女孩百年孤独 了不起的盖茨比遇见未知的自己 我不要你死于一事无成平凡的世界 考拉小巫的英语学习日记送你一颗子弹 高效能人士的7个习惯小王子 哪来的天才拖延心理学 被嫌弃的松子的一生活着 与众不同的心理学苏菲的世界 奇特的一生红楼梦 普罗旺斯的一年目送 接纳不完美的自己如何阅读一本书 我就是想停下来,看看这个世界围城 这些都是你给我的爱Big Data Anysis in Python 15Search “《[^》]+》” pattern
  16. 16. Machine Learning• Machine learnalgorithms– Supervised learning– Unsupervisedlearning• Processing– Model selection– PipelineBig Data Anysis in Python 16• Application: ArticleRecommendation– Let computer figuresout your interests,and then recommendarticles for you.– Learning byOneClassSVM
  17. 17. Big Data Anysis in Python 17articles = [coll.find_one({url: url})[content] for url in urls]tfidf = TfidfVectorizer(tokenizer=my_tokenizer,ngram_range=(1, 3))svm = OneClassSVM(kernel="linear", nu=0.3)train_vectors = tfidf.fit_transform(articles)• Result: recommend 390 from 8000 articles, and therecommended articles are good to me.Pickup somefavoritearticlesExtractfeatures fromarticle and letmodel learnfrom thefeaturesModel predicton unseenarticles andrecommendfor youArticle Recommendation
  18. 18. Numerical Data Analysis• Data Structure:– Series– Frame• Operations: slicing,indexing, subsettingBig Data Anysis in Python 18from pandas import Seriesimport matplotlib.pyplot as pltsa = Series(index=range(len(commenters)),data=commenters)fig, axs = plt.subplots(1, 3)sa.plot(ax=axs[0], title=number ofcommenter, style=r.)
  19. 19. Reference (1/2)• Book:– Wes McKinney, “Python for Data Analysis”,OReilly, 2012– Toby Segaran, “Programming CollectiveIntelligence: Building Smart Web 2.0Applications”, OReilly, 2008– Philipp K. Janert, “Data Analysis with OpenSource Tools”, OReilly, 2010• Coursera:– Web Intelligence and Big Data, url– Introduction to Data Science, url– Data Analysis, urlBig Data Anysis in Python 19
  20. 20. Reference (2/2)• Conference:– PyData 2013,– PyData 2012,• My former shares:– When Big Data Meet Python, COSCUP2012, slides– NLTK: natural language toolkit overviewand application, 2012, slidesBig Data Anysis in Python 20