Your SlideShare is downloading. ×
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big data analysis in python @ PyCon.tw 2013

4,228

Published on

Big data analysis involves several processes: collecting, storage, computing, analysis and visualization. In this slides, the author demonstrates these processes by using python tools to build a data …

Big data analysis involves several processes: collecting, storage, computing, analysis and visualization. In this slides, the author demonstrates these processes by using python tools to build a data product. The example is based on text-analyzing an online forum.

Published in: Technology
0 Comments
29 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,228
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
162
Comments
0
Likes
29
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Jimmy Lair97922028 [at] ntu.edu.twhttp://tw.linkedin.com/pub/jimmy-lai/27/4a/5362013/05/26This slides: http://www.slideshare.net/jimmy_lai/big-data-analysis-in-python
  • 2. Outlines1. Overview2. Big Data Analysis Example3. Scrapy4. Mongodb5. Solr6. Scikit-learnBig Data Anysis in Python 2
  • 3. When Big Data meet Python3CollectingUser GeneratedContentMachineGenerated DataScrapy: scraping frameworkPyMongo: Python client for MongodbHadoop streaming: Linux pipe interfaceDisco: lightweight MapReduce in PythonStorageComputingAnalysisVisualizationPandas: data analysis/manipulationStatsmodels: statisticsNLTK: natural language processingScikit-learn: machine learningSolr: full text search by REST APIMatplotlib: plottingNetworkX: graph visualizationBig Data Anysis in PythonInfrastructurehttp://www.slideshare.net/jimmy_lai/when-big-data-meet-python
  • 4. Why Python?• Good code readabilityfor fast development.• Scripting language:the less code, themore productivity.• Application:– Web– GUI– OS– Science• Fast growing amongopen sourcecommunities.– Commits statisticsfrom ohloh.net4http://www.kdnuggets.com/2012/08/poll-analytics-data-mining-programming-languages.html
  • 5. Why Python?• The support of open source projectsBig Data Anysis in Python 5http://www.ohloh.net/languages/compare?measure=projects&percent=true&l0=java&l1=perl&l2=php&l3=python&l4=r&l5=ruby&l6=-1&l7=-1&commit=Update
  • 6. Big Data Analysis ExampleBig Data Anysis in Python 6
  • 7. Big Data Analysis ExampleBig Data Anysis in Python 7
  • 8. web scraping framework• pip install scrapy• Parse field by XPath• Spider– Request– Response– Parse_function– Pipeline• For Big Data:– Parallel crawlingBig Data Anysis in Python 8http://scrapy.org/Traverse web pages
  • 9. Big Data Anysis in Python 9Parse articleParse comentsweb scraping frameworkhttp://scrapy.org/
  • 10. class GroupSpider(BaseSpider):name = "group"allowed_domains = ["douban.com"]start_urls = ["http://www.douban.com/group/111410/discussion?start=%d" % ifor i in range(10000, 20000, 25)]def __init__(self):self.coll = get_coll()def parse(self, response):hxs = HtmlXPathSelector(response)for tr in hxs.select(//tr):title, url, author, author_url = None, None, None, Noneresult = tr.select(td[@class="title"])if len(result) == 1:url = result.select(a/@href).extract()[0]if self.coll.find_one({url: url}) is not None:self.log(duplicated url: %s % (url))else:yield Request(url, callback=self.parse_article)Big Data Anysis in Python 10Traverse web pagesExtract field by XPath
  • 11. Document based NoSQL database• Python client: pip installpymongo• Json style document• For Big Data:– Mongodb cluster withshard and replica• article:– author– author_url– title– content– comments– urlBig Data Anysis in Python 11http://www.mongodb.org/import pymongoconn = pymongo.Connection()coll = conn.database.collectioncoll.ensure_index(url, unique=True)article = {author: u小明, author_url:http://host/path1, comments: [],title: u天天向上, content: , url:http://host/path2}coll.insert(article)results = [i for i in coll.find({title: u天天向上})]
  • 12. Full-text search engineBig Data Anysis in Python 12http://lucene.apache.org/solr/• Customize full-textindexing andsearching by xmlconfig• RESTful API forselect/update index• Spatial Search• For Big Data:– SolrCloud:distributed indexs
  • 13. Big Data Anysis in Python 13<field name="url" type="text_general" indexed="true" stored="true"/><field name="title" type="text_general" indexed="true" stored="true"/><field name="content" type="text_general" indexed="true" stored="true"/><field name="comments" type="text_general" indexed="true" stored="true"multiValued="true"/><field name="author" type="text_general" indexed="true" stored="true"/><field name="author_url" type="text_general" indexed="true" stored="true"/><copyField source="title" dest="text"/><copyField source="author" dest="text"/><copyField source="content" dest="text"/><copyField source="comments" dest="text"/><fieldType name="text_smart_chinese" class="solr.TextField"positionIncrementGap="100"><tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/><filter class="solr.SmartChineseWordTokenFilterFactory"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"enablePositionIncrements="true" /><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.PorterStemFilterFactory"/></fieldType><uniqueKey>url</uniqueKey>Define fileds and typesschema.xmlDefine index process
  • 14. Big Data Anysis in Python 14
  • 15. Popular Book Ranking ListBy regex search in Mongodb By full-text search in Solr少有人走的路 那些年我们一起追的女孩百年孤独 了不起的盖茨比遇见未知的自己 我不要你死于一事无成平凡的世界 考拉小巫的英语学习日记送你一颗子弹 高效能人士的7个习惯小王子 哪来的天才拖延心理学 被嫌弃的松子的一生活着 与众不同的心理学苏菲的世界 奇特的一生红楼梦 普罗旺斯的一年目送 接纳不完美的自己如何阅读一本书 我就是想停下来,看看这个世界围城 这些都是你给我的爱Big Data Anysis in Python 15Search “《[^》]+》” pattern
  • 16. Machine Learning• Machine learnalgorithms– Supervised learning– Unsupervisedlearning• Processing– Model selection– PipelineBig Data Anysis in Python 16http://scikit-learn.org/• Application: ArticleRecommendation– Let computer figuresout your interests,and then recommendarticles for you.– Learning byOneClassSVM
  • 17. Big Data Anysis in Python 17articles = [coll.find_one({url: url})[content] for url in urls]tfidf = TfidfVectorizer(tokenizer=my_tokenizer,ngram_range=(1, 3))svm = OneClassSVM(kernel="linear", nu=0.3)train_vectors = tfidf.fit_transform(articles)svm.fit(train_vectors)svm.predict(test_vectors)• Result: recommend 390 from 8000 articles, and therecommended articles are good to me.Pickup somefavoritearticlesExtractfeatures fromarticle and letmodel learnfrom thefeaturesModel predicton unseenarticles andrecommendfor youArticle Recommendation
  • 18. Numerical Data Analysis• Data Structure:– Series– Frame• Operations: slicing,indexing, subsettingBig Data Anysis in Python 18from pandas import Seriesimport matplotlib.pyplot as pltsa = Series(index=range(len(commenters)),data=commenters)fig, axs = plt.subplots(1, 3)sa.plot(ax=axs[0], title=number ofcommenter, style=r.)http://pandas.pydata.org/
  • 19. Reference (1/2)• Book:– Wes McKinney, “Python for Data Analysis”,OReilly, 2012– Toby Segaran, “Programming CollectiveIntelligence: Building Smart Web 2.0Applications”, OReilly, 2008– Philipp K. Janert, “Data Analysis with OpenSource Tools”, OReilly, 2010• Coursera:– Web Intelligence and Big Data, url– Introduction to Data Science, url– Data Analysis, urlBig Data Anysis in Python 19
  • 20. Reference (2/2)• Conference:– PyData 2013, http://pydata.org/abstracts/– PyData 2012,http://marakana.com/s/post/1090/2012_pydata_workshop• My former shares:– When Big Data Meet Python, COSCUP2012, slides– NLTK: natural language toolkit overviewand application, PyCon.tw 2012, slidesBig Data Anysis in Python 20

×