Your SlideShare is downloading. ×
0
When Big Data Meet Python                             Jimmy Lai (賴弘哲)                           jimmy.lai@oi-sys.com      ...
自我介紹• 賴弘哲 (Jimmy Lai)• Interests: Data mining, Machine Learning,  Natural Language Processing, Distributed  Computing, Pyt...
Outline1. Big Data  a. Concept  b. Technical issues2. Big Data + Python  a. Related open source tools  b. Example         ...
Benefits of Big Data1. Creating transparency(透明度) e.g. http://www.data.gov/2. Enabling experimentation to discover needs, ...
Initiative from the White House• (Mar 2012) Big Data Research and  Development Initiative, the White House.• National Scie...
Big Data IssuesUser Generated Content              Machine Generated Data                         Collecting              ...
Big Data Techniques                     MachineUser Generated   Content        Generated Data   • Crawler                 ...
Big Data TechniquesUser Generated       Machine                  Generated Data                                   • Big Ta...
Big Data Techniques                     MachineUser Generated   Content        Generated Data   • Batch                   ...
Big Data TechniquesUser Generated       Machine       • Data mining   Content        Generated Data                       ...
Big Data Techniques                     MachineUser Generated   Content        Generated Data   • Abstract                ...
Why Python?• Good code readability     • Fast growing among  for fast development.       open source• Scripting language: ...
When Big Data meet Python        User Generated       Machine           Content        Generated Data                   Co...
When Big Data meet PythonUser Generated       Machine                  Generated Data                            http://sc...
When Big Data meet PythonUser Generated       Machine                                                       http://www.mon...
When Big Data meet Python                     Machine                           http://discoproject.org/User Generated   C...
When Big Data meet PythonUser Generated       Machine   Content        Generated Data                                     ...
When Big Data meet PythonUser Generated       Machine               Statsmodels   Content        Generated Data           ...
When Big Data meet PythonUser Generated       Machine                      scikit-learn   Content        Generated Data   ...
When Big Data meet PythonUser Generated       Machine   Content        Generated Data   NLTK: Natural Language Toolkit    ...
When Big Data meet PythonUser Generated       Machine   Content        Generated Data   NL                                ...
When Big Data meet PythonUser Generated       Machine   Content        Generated Data   NetworkX http://networkx.lanl.gov/...
聚寶評 www.ezpao.com      美食搜尋引擎搜尋各大部落格食記  2012              23
聚寶評 www.ezpao.com     語意分析搜尋引擎  2012              24
評論主題分析  網友分享菜分析   正評/負評分析2012                  25
Thank you for your attention.           Q&AWe are hiring!• 核心引擎演算法研發工程師• 系統研發工程師• 網路應用研發工程師Oxygen Intelligence Taiwan Limi...
Upcoming SlideShare
Loading in...5
×

When big data meet python @ COSCUP 2012

3,351

Published on

Big Data consists of several issues: data collecting, storage, computing, analysis and visualization. Python is a popular scripting language with good code readability and thus is suitable for fast development. In this slides, the author shares how to solve Big Data issues using Python open source tools.

Published in: Technology, Education
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,351
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
192
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide

Transcript of "When big data meet python @ COSCUP 2012"

  1. 1. When Big Data Meet Python Jimmy Lai (賴弘哲) jimmy.lai@oi-sys.com 2012/08/19Slides: http://www.slideshare.net/jimmy_lai/when-big-data-meet-python 2012 When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. 1
  2. 2. 自我介紹• 賴弘哲 (Jimmy Lai)• Interests: Data mining, Machine Learning, Natural Language Processing, Distributed Computing, Python• LindedIn profile: http://goo.gl/XTEM5• 現任職於引京聚點知識結構搜索公司, 從事大資料語意分析 2012 2
  3. 3. Outline1. Big Data a. Concept b. Technical issues2. Big Data + Python a. Related open source tools b. Example 2012 3
  4. 4. Benefits of Big Data1. Creating transparency(透明度) e.g. http://www.data.gov/2. Enabling experimentation to discover needs, expose variability, and improve performance(發現需求及潛在威脅、改善產能)3. Segmenting populations to customize(客製化) actions4. Replacing/supporting human decision making with automated algorithms(自動決策)5. Innovating new business models, products and services(創新的服務、產業)深度資料分析人才的短缺 (May 2011). Big Data: The next frontier for innovation, competition, and productivity. 2012 McKinsey Global Institute. 4
  5. 5. Initiative from the White House• (Mar 2012) Big Data Research and Development Initiative, the White House.• National Science Foundation encourages education on Big Data.• Government invest on developing state-of- the-art technologies, harness those technologies, and expand the workforce for Big Data. 2012 5
  6. 6. Big Data IssuesUser Generated Content Machine Generated Data Collecting Storage Computing Analysis Visualization 2012 6
  7. 7. Big Data Techniques MachineUser Generated Content Generated Data • Crawler – Collect raw data Collecting – E.g. Heritrix, Nutch • Scraping Storage – Parse information Computing from raw data – E.g. Yahoo! Pipes, Analysis Scrapy Visualization 2012 7
  8. 8. Big Data TechniquesUser Generated Machine Generated Data • Big Table Content – Distributed key-value storage Collecting – E.g.Hbase, Cassandra • NoSQL Storage – Not use SQL for manipulation Computing – Not use relational database model Analysis – E.g. MongoDB, Redis, CouchDB Visualization 2012 8
  9. 9. Big Data Techniques MachineUser Generated Content Generated Data • Batch – MapReduce Collecting – E.g. Hadoop • Real-time Storage – Stream processing Computing – E.g. S4, Storm Analysis Visualization 2012 9
  10. 10. Big Data TechniquesUser Generated Machine • Data mining Content Generated Data – Weka • Machine learning Collecting – scikit-learn • Natural language Storage processing – NLTK, Stanford NLP Computing • Statistics –R Analysis Visualization 2012 10
  11. 11. Big Data Techniques MachineUser Generated Content Generated Data • Abstract • Interactive Collecting • E.g. Processing, Gephi, D3.js Storage Computing Analysis Visualization 2012 11
  12. 12. Why Python?• Good code readability • Fast growing among for fast development. open source• Scripting language: the communities. less code, the more – Commits statistics from productivity. ohloh.net 2012 12
  13. 13. When Big Data meet Python User Generated Machine Content Generated Data Collecting Scrapy: scraping framework PyMongo: Python client for MongodbInfrastructure Storage Hadoop streaming: Linux pipe interface Computing Disco: lightweight MapReduce in Python Pandas: data analysis/manipulation Analysis Statsmodels: statistics NLTK: natural language processing Scikit-learn: machine learning Visualization Matplotlib: plotting 2012 NetworkX: graph visualization 13
  14. 14. When Big Data meet PythonUser Generated Machine Generated Data http://scrapy.org/ Content web scraping framework • Simple and Extensible Collecting • Components: • Scheduler Storage • Downloader • Spider(Scraper) Computing • Item pipeline Analysis Visualization 2012 14
  15. 15. When Big Data meet PythonUser Generated Machine http://www.mongodb.org/ Content Generated Data NoSQL database • PyMongo: client for python Collecting • Document(JSON)-oriented • No schema Storage • Scalable • Auto-sharding Computing • Replica-set Analysis • File storage • MapReduce aggregation Visualization 2012 15
  16. 16. When Big Data meet Python Machine http://discoproject.org/User Generated Content Generated Data • Distributed computing: – MapReduce Collecting – Disco distributed file system • Write code in Python Storage – Easy/fast to profiling – Easy/fast to debugging Computing Analysis Visualization 2012 16
  17. 17. When Big Data meet PythonUser Generated Machine Content Generated Data http://pandas.pydata.org/ • Data analysis library Collecting • Datastructure for fast data manipulation – Slicing Storage – Indexing – subsetting Computing • Handling missing data Analysis • Aggregation • Time series Visualization 2012 17
  18. 18. When Big Data meet PythonUser Generated Machine Statsmodels Content Generated Data http://statsmodels.sourceforge.net/ • Statistical analysis Collecting • Statistical models • Fit data with model Storage • Statistical tests • Data exploration Computing • Time series analysis Analysis Visualization 2012 18
  19. 19. When Big Data meet PythonUser Generated Machine scikit-learn Content Generated Data http://scikit-learn.org/ • Machine learning algorithms • Supervised learning Collecting • Unsupervised learning • Dataset Storage • Preprocessing Computing • feature extraction • Model Analysis • Selection • Pipeline Visualization 2012 19
  20. 20. When Big Data meet PythonUser Generated Machine Content Generated Data NLTK: Natural Language Toolkit http://scikit-learn.org/ • Natural language processing Collecting • Annotated corpora and resources Information Extraction Work Flow Storage Sentence Segmentation Tokenization POS tagging Computing Named Entity Relation Recognition Recognition Analysis Visualization 2012 20
  21. 21. When Big Data meet PythonUser Generated Machine Content Generated Data NL http://matplotlib.sourceforge.net/ • Plotting Collecting – Histograms – Power spectra Storage – Bar charts – Error charts Computing – Scatter plots • Full control to detail of plotting Analysis Visualization 2012 21
  22. 22. When Big Data meet PythonUser Generated Machine Content Generated Data NetworkX http://networkx.lanl.gov/ • Graph algorithms and visisualization Collecting • Draw graph with layout: – Circular Storage – Random – Spectural Computing – Spring – Shell Analysis – Graphviz Visualization 2012 22
  23. 23. 聚寶評 www.ezpao.com 美食搜尋引擎搜尋各大部落格食記 2012 23
  24. 24. 聚寶評 www.ezpao.com 語意分析搜尋引擎 2012 24
  25. 25. 評論主題分析 網友分享菜分析 正評/負評分析2012 25
  26. 26. Thank you for your attention. Q&AWe are hiring!• 核心引擎演算法研發工程師• 系統研發工程師• 網路應用研發工程師Oxygen Intelligence Taiwan Limited引京聚點 知識結構搜索股份有限公司• 公司簡介: http://www.ezpao.com/about/• 職缺簡介: http://www.ezpao.com/join/• 請將履歷寄到 jimmy.lai@oi-sys.com 2012 When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. 26
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×