SlideShare a Scribd company logo
1 of 20
Download to read offline
Jimmy Lai
r97922028 [at] ntu.edu.tw
http://tw.linkedin.com/pub/jimmy-lai/27/4a/536
2013/05/26
This slides: http://www.slideshare.net/jimmy_lai/big-data-analysis-in-python
Outlines
1. Overview
2. Big Data Analysis Example
3. Scrapy
4. Mongodb
5. Solr
6. Scikit-learn
Big Data Anysis in Python 2
When Big Data meet Python
3
Collecting
User Generated
Content
Machine
Generated Data
Scrapy: scraping framework
PyMongo: Python client for Mongodb
Hadoop streaming: Linux pipe interface
Disco: lightweight MapReduce in Python
Storage
Computing
Analysis
Visualization
Pandas: data analysis/manipulation
Statsmodels: statistics
NLTK: natural language processing
Scikit-learn: machine learning
Solr: full text search by REST API
Matplotlib: plotting
NetworkX: graph visualization
Big Data Anysis in Python
Infrastructure
http://www.slideshare.net/jimmy_lai/when-big-data-meet-python
Why Python?
• Good code readability
for fast development.
• Scripting language:
the less code, the
more productivity.
• Application:
– Web
– GUI
– OS
– Science
• Fast growing among
open source
communities.
– Commits statistics
from ohloh.net
4
http://www.kdnuggets.com/2012/08/poll-analytics-data-mining-
programming-languages.html
Why Python?
• The support of open source projects
Big Data Anysis in Python 5
http://www.ohloh.net/languages/compare?measure=projects&percent=true&l0=java&l1=
perl&l2=php&l3=python&l4=r&l5=ruby&l6=-1&l7=-1&commit=Update
Big Data Analysis Example
Big Data Anysis in Python 6
Big Data Analysis Example
Big Data Anysis in Python 7
web scraping framework
• pip install scrapy
• Parse field by XPath
• Spider
– Request
– Response
– Parse_function
– Pipeline
• For Big Data:
– Parallel crawling
Big Data Anysis in Python 8
http://scrapy.org/
Traverse web pages
Big Data Anysis in Python 9
Parse article
Parse coments
web scraping framework
http://scrapy.org/
class GroupSpider(BaseSpider):
name = "group"
allowed_domains = ["douban.com"]
start_urls = ["http://www.douban.com/group/111410/discussion?start=%d" % i
for i in range(10000, 20000, 25)]
def __init__(self):
self.coll = get_coll()
def parse(self, response):
hxs = HtmlXPathSelector(response)
for tr in hxs.select('//tr'):
title, url, author, author_url = None, None, None, None
result = tr.select('td[@class="title"]')
if len(result) == 1:
url = result.select('a/@href').extract()[0]
if self.coll.find_one({'url': url}) is not None:
self.log('duplicated url: %s' % (url))
else:
yield Request(url, callback=self.parse_article)
Big Data Anysis in Python 10
Traverse web pages
Extract field by XPath
Document based NoSQL database
• Python client: pip install
pymongo
• Json style document
• For Big Data:
– Mongodb cluster with
shard and replica
• article:
– author
– author_url
– title
– content
– comments
– url
Big Data Anysis in Python 11
http://www.mongodb.org/
import pymongo
conn = pymongo.Connection()
coll = conn.database.collection
coll.ensure_index('url', unique=True)
article = {'author': u'小明', 'author_url':
'http://host/path1', 'comments': [],
'title': u'天天向上', 'content': '', 'url':
'http://host/path2'}
coll.insert(article)
results = [i for i in coll.find({'title': u'天天
向上'})]
Full-text search engine
Big Data Anysis in Python 12
http://lucene.apache.org/solr/
• Customize full-text
indexing and
searching by xml
config
• RESTful API for
select/update index
• Spatial Search
• For Big Data:
– SolrCloud:
distributed indexs
Big Data Anysis in Python 13
<field name="url" type="text_general" indexed="true" stored="true"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="content" type="text_general" indexed="true" stored="true"/>
<field name="comments" type="text_general" indexed="true" stored="true"
multiValued="true"/>
<field name="author" type="text_general" indexed="true" stored="true"/>
<field name="author_url" type="text_general" indexed="true" stored="true"/>
<copyField source="title" dest="text"/>
<copyField source="author" dest="text"/>
<copyField source="content" dest="text"/>
<copyField source="comments" dest="text"/>
<fieldType name="text_smart_chinese" class="solr.TextField"
positionIncrementGap="100">
<tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
<filter class="solr.SmartChineseWordTokenFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</fieldType>
<uniqueKey>url</uniqueKey>
Define fileds and types
schema.xml
Define index process
Big Data Anysis in Python 14
Popular Book Ranking List
By regex search in Mongodb By full-text search in Solr
少有人走的路 那些年我们一起追的女孩
百年孤独 了不起的盖茨比
遇见未知的自己 我不要你死于一事无成
平凡的世界 考拉小巫的英语学习日记
送你一颗子弹 高效能人士的7个习惯
小王子 哪来的天才
拖延心理学 被嫌弃的松子的一生
活着 与众不同的心理学
苏菲的世界 奇特的一生
红楼梦 普罗旺斯的一年
目送 接纳不完美的自己
如何阅读一本书 我就是想停下来,看看这个世界
围城 这些都是你给我的爱
Big Data Anysis in Python 15Search “《[^》]+》” pattern
Machine Learning
• Machine learn
algorithms
– Supervised learning
– Unsupervised
learning
• Processing
– Model selection
– Pipeline
Big Data Anysis in Python 16
http://scikit-learn.org/
• Application: Article
Recommendation
– Let computer figures
out your interests,
and then recommend
articles for you.
– Learning by
OneClassSVM
Big Data Anysis in Python 17
articles = [coll.find_one({'url': url})['content'] for url in urls]
tfidf = TfidfVectorizer(tokenizer=my_tokenizer,
ngram_range=(1, 3))
svm = OneClassSVM(kernel="linear", nu=0.3)
train_vectors = tfidf.fit_transform(articles)
svm.fit(train_vectors)
svm.predict(test_vectors)
• Result: recommend 390 from 8000 articles, and the
recommended articles are good to me.
Pickup some
favorite
articles
Extract
features from
article and let
model learn
from the
features
Model predict
on unseen
articles and
recommend
for you
Article Recommendation
Numerical Data Analysis
• Data Structure:
– Series
– Frame
• Operations: slicing,
indexing, subsetting
Big Data Anysis in Python 18
from pandas import Series
import matplotlib.pyplot as plt
sa = Series(index=range(len(commenters)),
data=commenters)
fig, axs = plt.subplots(1, 3)
sa.plot(ax=axs[0], title='number of
commenter', style='r.')
http://pandas.pydata.org/
Reference (1/2)
• Book:
– Wes McKinney, “Python for Data Analysis”,
O'Reilly, 2012
– Toby Segaran, “Programming Collective
Intelligence: Building Smart Web 2.0
Applications”, O'Reilly, 2008
– Philipp K. Janert, “Data Analysis with Open
Source Tools”, O'Reilly, 2010
• Coursera:
– Web Intelligence and Big Data, url
– Introduction to Data Science, url
– Data Analysis, url
Big Data Anysis in Python 19
Reference (2/2)
• Conference:
– PyData 2013, http://pydata.org/abstracts/
– PyData 2012,
http://marakana.com/s/post/1090/2012_pydata_workshop
• My former shares:
– When Big Data Meet Python, COSCUP
2012, slides
– NLTK: natural language toolkit overview
and application, PyCon.tw 2012, slides
Big Data Anysis in Python 20

More Related Content

What's hot

Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
Data Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioData Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioWinston Chen
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using pythonPurna Chander
 
Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19GoDataDriven
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBMohamed Taher Alrefaie
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 

What's hot (20)

Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Data Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioData Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudio
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19
 
Data Analysis With Pandas
Data Analysis With PandasData Analysis With Pandas
Data Analysis With Pandas
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Connecting HDF with ISO Metadata Standards
Connecting HDF with ISO Metadata StandardsConnecting HDF with ISO Metadata Standards
Connecting HDF with ISO Metadata Standards
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
Session 2
Session 2Session 2
Session 2
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 

Viewers also liked

pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to PythonNowell Strite
 
Python and Data Analysis
Python and Data AnalysisPython and Data Analysis
Python and Data AnalysisPraveen Nair
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With PythonSarah Guido
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesMatt Harrison
 
Behold the Power of Python
Behold the Power of PythonBehold the Power of Python
Behold the Power of PythonSarah Dutkiewicz
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst NanodegreeJimmy Lai
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookJimmy Lai
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr UsageJimmy Lai
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast PrototypingJimmy Lai
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHugJimmy Lai
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in pythonJimmy Lai
 
Relational Database Access with Python
Relational Database Access with PythonRelational Database Access with Python
Relational Database Access with PythonMark Rees
 
Python for Data Analysis: Chapter 2
Python for Data Analysis: Chapter 2Python for Data Analysis: Chapter 2
Python for Data Analysis: Chapter 2智哉 今西
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesJimmy Lai
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge BaseBuild a Searchable Knowledge Base
Build a Searchable Knowledge BaseJimmy Lai
 
Practical Data Analysis in Python
Practical Data Analysis in PythonPractical Data Analysis in Python
Practical Data Analysis in PythonHilary Mason
 

Viewers also liked (20)

pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
Python Presentation
Python PresentationPython Presentation
Python Presentation
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Python and Data Analysis
Python and Data AnalysisPython and Data Analysis
Python and Data Analysis
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
Behold the Power of Python
Behold the Power of PythonBehold the Power of Python
Behold the Power of Python
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst Nanodegree
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython Notebook
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr Usage
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHug
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in python
 
Relational Database Access with Python
Relational Database Access with PythonRelational Database Access with Python
Relational Database Access with Python
 
Python for Data Analysis: Chapter 2
Python for Data Analysis: Chapter 2Python for Data Analysis: Chapter 2
Python for Data Analysis: Chapter 2
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge BaseBuild a Searchable Knowledge Base
Build a Searchable Knowledge Base
 
Practical Data Analysis in Python
Practical Data Analysis in PythonPractical Data Analysis in Python
Practical Data Analysis in Python
 

Similar to Big data analysis in python @ PyCon.tw 2013

Custom web application development with Django for startups and Django-CRM intro
Custom web application development with Django for startups and Django-CRM introCustom web application development with Django for startups and Django-CRM intro
Custom web application development with Django for startups and Django-CRM introMicroPyramid .
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science PlatformQAware GmbH
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityRaffael Marty
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nltieleman
 
Python Load Testing - Pygotham 2012
Python Load Testing - Pygotham 2012Python Load Testing - Pygotham 2012
Python Load Testing - Pygotham 2012Dan Kuebrich
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
 
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Jimmy DeadcOde
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...it-people
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Yury Leonychev
 

Similar to Big data analysis in python @ PyCon.tw 2013 (20)

Python ml
Python mlPython ml
Python ml
 
Custom web application development with Django for startups and Django-CRM intro
Custom web application development with Django for startups and Django-CRM introCustom web application development with Django for startups and Django-CRM intro
Custom web application development with Django for startups and Django-CRM intro
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
Analyse Yourself
Analyse YourselfAnalyse Yourself
Analyse Yourself
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for Security
 
Mini Curso de Django
Mini Curso de DjangoMini Curso de Django
Mini Curso de Django
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Python Load Testing - Pygotham 2012
Python Load Testing - Pygotham 2012Python Load Testing - Pygotham 2012
Python Load Testing - Pygotham 2012
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
 
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
 
Scrapy workshop
Scrapy workshopScrapy workshop
Scrapy workshop
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 

More from Jimmy Lai

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdfJimmy Lai
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramJimmy Lai
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Jimmy Lai
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learnJimmy Lai
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugJimmy Lai
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
 

More from Jimmy Lai (7)

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdf
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagram
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 

Recently uploaded

EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2DianaGray10
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameKapil Thakar
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 

Recently uploaded (20)

EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First Frame
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 

Big data analysis in python @ PyCon.tw 2013

  • 1. Jimmy Lai r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536 2013/05/26 This slides: http://www.slideshare.net/jimmy_lai/big-data-analysis-in-python
  • 2. Outlines 1. Overview 2. Big Data Analysis Example 3. Scrapy 4. Mongodb 5. Solr 6. Scikit-learn Big Data Anysis in Python 2
  • 3. When Big Data meet Python 3 Collecting User Generated Content Machine Generated Data Scrapy: scraping framework PyMongo: Python client for Mongodb Hadoop streaming: Linux pipe interface Disco: lightweight MapReduce in Python Storage Computing Analysis Visualization Pandas: data analysis/manipulation Statsmodels: statistics NLTK: natural language processing Scikit-learn: machine learning Solr: full text search by REST API Matplotlib: plotting NetworkX: graph visualization Big Data Anysis in Python Infrastructure http://www.slideshare.net/jimmy_lai/when-big-data-meet-python
  • 4. Why Python? • Good code readability for fast development. • Scripting language: the less code, the more productivity. • Application: – Web – GUI – OS – Science • Fast growing among open source communities. – Commits statistics from ohloh.net 4 http://www.kdnuggets.com/2012/08/poll-analytics-data-mining- programming-languages.html
  • 5. Why Python? • The support of open source projects Big Data Anysis in Python 5 http://www.ohloh.net/languages/compare?measure=projects&percent=true&l0=java&l1= perl&l2=php&l3=python&l4=r&l5=ruby&l6=-1&l7=-1&commit=Update
  • 6. Big Data Analysis Example Big Data Anysis in Python 6
  • 7. Big Data Analysis Example Big Data Anysis in Python 7
  • 8. web scraping framework • pip install scrapy • Parse field by XPath • Spider – Request – Response – Parse_function – Pipeline • For Big Data: – Parallel crawling Big Data Anysis in Python 8 http://scrapy.org/ Traverse web pages
  • 9. Big Data Anysis in Python 9 Parse article Parse coments web scraping framework http://scrapy.org/
  • 10. class GroupSpider(BaseSpider): name = "group" allowed_domains = ["douban.com"] start_urls = ["http://www.douban.com/group/111410/discussion?start=%d" % i for i in range(10000, 20000, 25)] def __init__(self): self.coll = get_coll() def parse(self, response): hxs = HtmlXPathSelector(response) for tr in hxs.select('//tr'): title, url, author, author_url = None, None, None, None result = tr.select('td[@class="title"]') if len(result) == 1: url = result.select('a/@href').extract()[0] if self.coll.find_one({'url': url}) is not None: self.log('duplicated url: %s' % (url)) else: yield Request(url, callback=self.parse_article) Big Data Anysis in Python 10 Traverse web pages Extract field by XPath
  • 11. Document based NoSQL database • Python client: pip install pymongo • Json style document • For Big Data: – Mongodb cluster with shard and replica • article: – author – author_url – title – content – comments – url Big Data Anysis in Python 11 http://www.mongodb.org/ import pymongo conn = pymongo.Connection() coll = conn.database.collection coll.ensure_index('url', unique=True) article = {'author': u'小明', 'author_url': 'http://host/path1', 'comments': [], 'title': u'天天向上', 'content': '', 'url': 'http://host/path2'} coll.insert(article) results = [i for i in coll.find({'title': u'天天 向上'})]
  • 12. Full-text search engine Big Data Anysis in Python 12 http://lucene.apache.org/solr/ • Customize full-text indexing and searching by xml config • RESTful API for select/update index • Spatial Search • For Big Data: – SolrCloud: distributed indexs
  • 13. Big Data Anysis in Python 13 <field name="url" type="text_general" indexed="true" stored="true"/> <field name="title" type="text_general" indexed="true" stored="true"/> <field name="content" type="text_general" indexed="true" stored="true"/> <field name="comments" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="author" type="text_general" indexed="true" stored="true"/> <field name="author_url" type="text_general" indexed="true" stored="true"/> <copyField source="title" dest="text"/> <copyField source="author" dest="text"/> <copyField source="content" dest="text"/> <copyField source="comments" dest="text"/> <fieldType name="text_smart_chinese" class="solr.TextField" positionIncrementGap="100"> <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/> <filter class="solr.SmartChineseWordTokenFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </fieldType> <uniqueKey>url</uniqueKey> Define fileds and types schema.xml Define index process
  • 14. Big Data Anysis in Python 14
  • 15. Popular Book Ranking List By regex search in Mongodb By full-text search in Solr 少有人走的路 那些年我们一起追的女孩 百年孤独 了不起的盖茨比 遇见未知的自己 我不要你死于一事无成 平凡的世界 考拉小巫的英语学习日记 送你一颗子弹 高效能人士的7个习惯 小王子 哪来的天才 拖延心理学 被嫌弃的松子的一生 活着 与众不同的心理学 苏菲的世界 奇特的一生 红楼梦 普罗旺斯的一年 目送 接纳不完美的自己 如何阅读一本书 我就是想停下来,看看这个世界 围城 这些都是你给我的爱 Big Data Anysis in Python 15Search “《[^》]+》” pattern
  • 16. Machine Learning • Machine learn algorithms – Supervised learning – Unsupervised learning • Processing – Model selection – Pipeline Big Data Anysis in Python 16 http://scikit-learn.org/ • Application: Article Recommendation – Let computer figures out your interests, and then recommend articles for you. – Learning by OneClassSVM
  • 17. Big Data Anysis in Python 17 articles = [coll.find_one({'url': url})['content'] for url in urls] tfidf = TfidfVectorizer(tokenizer=my_tokenizer, ngram_range=(1, 3)) svm = OneClassSVM(kernel="linear", nu=0.3) train_vectors = tfidf.fit_transform(articles) svm.fit(train_vectors) svm.predict(test_vectors) • Result: recommend 390 from 8000 articles, and the recommended articles are good to me. Pickup some favorite articles Extract features from article and let model learn from the features Model predict on unseen articles and recommend for you Article Recommendation
  • 18. Numerical Data Analysis • Data Structure: – Series – Frame • Operations: slicing, indexing, subsetting Big Data Anysis in Python 18 from pandas import Series import matplotlib.pyplot as plt sa = Series(index=range(len(commenters)), data=commenters) fig, axs = plt.subplots(1, 3) sa.plot(ax=axs[0], title='number of commenter', style='r.') http://pandas.pydata.org/
  • 19. Reference (1/2) • Book: – Wes McKinney, “Python for Data Analysis”, O'Reilly, 2012 – Toby Segaran, “Programming Collective Intelligence: Building Smart Web 2.0 Applications”, O'Reilly, 2008 – Philipp K. Janert, “Data Analysis with Open Source Tools”, O'Reilly, 2010 • Coursera: – Web Intelligence and Big Data, url – Introduction to Data Science, url – Data Analysis, url Big Data Anysis in Python 19
  • 20. Reference (2/2) • Conference: – PyData 2013, http://pydata.org/abstracts/ – PyData 2012, http://marakana.com/s/post/1090/2012_pydata_workshop • My former shares: – When Big Data Meet Python, COSCUP 2012, slides – NLTK: natural language toolkit overview and application, PyCon.tw 2012, slides Big Data Anysis in Python 20