SlideShare a Scribd company logo
1 of 7
Download to read offline
Caso Ideal: WGET $URL > /tmp/mypage.html 
Nunca es el caso ideal 
self.conn = pycurl.Curl() 
# Restart connection if less than 1 byte/s is received during "timeout" 
seconds 
if isinstance(self.timeout, int): 
self.conn.setopt(pycurl.LOW_SPEED_LIMIT, 1) 
self.conn.setopt(pycurl.LOW_SPEED_TIME, self.timeout) 
self.conn.setopt(pycurl.URL, API_ENDPOINT_URL) 
self.conn.setopt(pycurl.USERAGENT, USER_AGENT) 
# Using gzip is optional but saves us bandwidth. 
self.conn.setopt(pycurl.ENCODING, 'deflate, gzip') 
self.conn.setopt(pycurl.POST, 1) 
self.conn.setopt(pycurl.POSTFIELDS, urllib.urlencode(POST_PARAMS)) 
self.conn.setopt(pycurl.HTTPHEADER, ['Host: stream.twitter.com', 
● Necesitamos PUTs, no Solo GETS 
'Authorization: %s' % 
self.get_oauth_header()]) 
# self.handle_tweet is the method that are called when new tweets arrive 
self.conn.setopt(pycurl.WRITEFUNCTION, self.handle_tweet) 
● A veces queremos scrappear un Stream, con 
reconexiones 
● Hay que enviar cabeceras, cookies de sesion... 
● ¡En la DeepWeb hace falta user y password!
Requests: “HTTP for Humans” 
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) 
>>> r.status_code 
200 
>>> r.headers['content­type'] 
'application/json; charset=utf8' 
>>> r.encoding 
'utf­8' 
>>> r.text 
u'{"type":"User"...' 
>>> r.json() 
{u'private_gists': 419, u'total_private_repos': 77, ...} 
http://docs.python-requests.org/en/latest/
¡ajax, sesiones, navegación! 
● Si con curl o requests no basta, hay que emular 
un navegador. 
● Webdriver, en Selenium 
http://www.seleniumhq.org/projects/webdriver/ 
WebDriver driver = new FirefoxDriver(); 
// And now use this to visit Google 
driver.get("http://www.google.com"); 
// Alternatively the same thing can be done like this 
// driver.navigate().to("http://www.google.com"); 
// Find the text input element by its name 
WebElement element = driver.findElement(By.name("q")); 
// Enter something to search for 
element.sendKeys("Cheese!"); 
// Now submit the form. WebDriver will find the form for us from the element 
element.submit();
Nettiquette 
(Para que no digas que nunca te lo han dicho). 
● Mira el /robots.txt de los sitios que vayas a 
scrappear. 
● Honestamente, habria que mirar tambien las cabeceras 
x-robots en HTTP y las tag robots en el HTML 
● Controla la velocidad. Si el sitio va lento, baja la 
presion. 
● Y al reves, para más velocidad: usar multiples IP, usar 
mutiples scrappers, lanzar proxies en la nube... 
¿Httplib2 + squid? 
● Indica en el UserAgent una forma de 
contactarte. Email, web.
Parsing 
● Html/xml: Sax, Xpath, … 
● Json: .loads(), etc … 
● JS en el server: nodejs 
● BeautifulSoup 
import xml.etree.ElementTree as ET 
xsltproc :-( 
<xsl:stylesheet 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
version="1.0"> 
<xsl:output method="text"/> 
<xsl:template match="/"> 
<xsl:for­each 
select="response/results/<xsl:value­of 
select="field[@id='content']" /> 
<xsl:text>&#xA;</xsl:text> 
</xsl:for­each> 
</xsl:template> 
</xsl:stylesheet> 
http://www.crummy.com/software/BeautifulSoup/ 
for link in soup.find_all('a'): 
print(link.get('href')) 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc) 
print(soup.prettify())
Almacenando y Analizando 
● Postgresql: tiene extensiones json y GIS 
● Mysql: … 
select position,json­>' 
user'­>>' 
screen_name', json­>>' 
text' from georaw 
where cod_prov='28' and st_Distance_Sphere(position::geometry, 
st_makepoint(­3.73679,40.44439)) 
< 50; 
● Hdfs/hive/etc: si tienes mas de una máquina. 
– (o una con muchos cores) 
– (o podrias tenerlas y quieres usar spark, mapreduce, etc) 
./bin/spark­shell 
­­total­executor­cores 
7 
sc.textFile("hdfs://localhost:9000/user/hadoopsingle/geoRaw").filter(line => 
line.contains("trafico")).count 
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) 
import hiveContext._ 
import org.apache.spark.sql.catalyst.expressions._ 
val TableHQL = hiveContext.hql ("FROM raw.csv SELECT id, 
type,length").groupBy(..........).persist() 
TableHQL.map{case Row( id, t,l) => (l.asInstanceOf[Double] * 0.30) }.reduce(_+_)
● API (Continuará...)

More Related Content

What's hot

Odoo Performance Limits
Odoo Performance LimitsOdoo Performance Limits
Odoo Performance LimitsOdoo
 
Владимир Мигуро "Дао Node.js"
Владимир Мигуро "Дао Node.js"Владимир Мигуро "Дао Node.js"
Владимир Мигуро "Дао Node.js"EPAM Systems
 
Introduction to using MongoDB with Ruby
Introduction to using MongoDB with RubyIntroduction to using MongoDB with Ruby
Introduction to using MongoDB with RubyJonathan Holloway
 
opentsdb in a real enviroment
opentsdb in a real enviromentopentsdb in a real enviroment
opentsdb in a real enviromentChen Robert
 
Searched gems which supports only ruby 2.6
Searched gems which supports only ruby 2.6Searched gems which supports only ruby 2.6
Searched gems which supports only ruby 2.6Maki Toshio
 
UDPSRC GStreamer Plugin Session VIII
UDPSRC GStreamer Plugin Session VIIIUDPSRC GStreamer Plugin Session VIII
UDPSRC GStreamer Plugin Session VIIINEEVEE Technologies
 
Rubyslava + PyVo #48
Rubyslava + PyVo #48Rubyslava + PyVo #48
Rubyslava + PyVo #48Jozef Képesi
 
Pusherアプリの作り方
Pusherアプリの作り方Pusherアプリの作り方
Pusherアプリの作り方Jun OHWADA
 
Odoo Online platform: architecture and challenges
Odoo Online platform: architecture and challengesOdoo Online platform: architecture and challenges
Odoo Online platform: architecture and challengesOdoo
 
Rustでパケットと戯れる
Rustでパケットと戯れるRustでパケットと戯れる
Rustでパケットと戯れるShuyaMotouchi1
 
Asynchronous PHP and Real-time Messaging
Asynchronous PHP and Real-time MessagingAsynchronous PHP and Real-time Messaging
Asynchronous PHP and Real-time MessagingSteve Rhoades
 
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013Puppet
 
THE RED METHOD: HOW TO INSTRUMENT YOUR SERVICES
THE RED METHOD: HOW TO INSTRUMENT YOUR SERVICESTHE RED METHOD: HOW TO INSTRUMENT YOUR SERVICES
THE RED METHOD: HOW TO INSTRUMENT YOUR SERVICESInfluxData
 
The power of streams in node js
The power of streams in node jsThe power of streams in node js
The power of streams in node jsJawahar
 

What's hot (20)

Odoo Performance Limits
Odoo Performance LimitsOdoo Performance Limits
Odoo Performance Limits
 
Jk rubyslava 25
Jk rubyslava 25Jk rubyslava 25
Jk rubyslava 25
 
RabbitMQ for Perl mongers
RabbitMQ for Perl mongersRabbitMQ for Perl mongers
RabbitMQ for Perl mongers
 
Владимир Мигуро "Дао Node.js"
Владимир Мигуро "Дао Node.js"Владимир Мигуро "Дао Node.js"
Владимир Мигуро "Дао Node.js"
 
Ansible 2.0 spblug
Ansible 2.0 spblugAnsible 2.0 spblug
Ansible 2.0 spblug
 
Introduction to using MongoDB with Ruby
Introduction to using MongoDB with RubyIntroduction to using MongoDB with Ruby
Introduction to using MongoDB with Ruby
 
opentsdb in a real enviroment
opentsdb in a real enviromentopentsdb in a real enviroment
opentsdb in a real enviroment
 
Searched gems which supports only ruby 2.6
Searched gems which supports only ruby 2.6Searched gems which supports only ruby 2.6
Searched gems which supports only ruby 2.6
 
UDPSRC GStreamer Plugin Session VIII
UDPSRC GStreamer Plugin Session VIIIUDPSRC GStreamer Plugin Session VIII
UDPSRC GStreamer Plugin Session VIII
 
Capistrano Rails
Capistrano RailsCapistrano Rails
Capistrano Rails
 
Rubyslava + PyVo #48
Rubyslava + PyVo #48Rubyslava + PyVo #48
Rubyslava + PyVo #48
 
Node js
Node jsNode js
Node js
 
Pusherアプリの作り方
Pusherアプリの作り方Pusherアプリの作り方
Pusherアプリの作り方
 
Odoo Online platform: architecture and challenges
Odoo Online platform: architecture and challengesOdoo Online platform: architecture and challenges
Odoo Online platform: architecture and challenges
 
Rustでパケットと戯れる
Rustでパケットと戯れるRustでパケットと戯れる
Rustでパケットと戯れる
 
Asynchronous PHP and Real-time Messaging
Asynchronous PHP and Real-time MessagingAsynchronous PHP and Real-time Messaging
Asynchronous PHP and Real-time Messaging
 
Node.js Stream API
Node.js Stream APINode.js Stream API
Node.js Stream API
 
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
 
THE RED METHOD: HOW TO INSTRUMENT YOUR SERVICES
THE RED METHOD: HOW TO INSTRUMENT YOUR SERVICESTHE RED METHOD: HOW TO INSTRUMENT YOUR SERVICES
THE RED METHOD: HOW TO INSTRUMENT YOUR SERVICES
 
The power of streams in node js
The power of streams in node jsThe power of streams in node js
The power of streams in node js
 

Similar to Open Social Data (Jaca), Alejandro Rivero

"Swoole: double troubles in c", Alexandr Vronskiy
"Swoole: double troubles in c", Alexandr Vronskiy"Swoole: double troubles in c", Alexandr Vronskiy
"Swoole: double troubles in c", Alexandr VronskiyFwdays
 
Challenges when building high profile editorial sites
Challenges when building high profile editorial sitesChallenges when building high profile editorial sites
Challenges when building high profile editorial sitesYann Malet
 
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...addame
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 
Hatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudyHatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudykoedoyoshida
 
Lights, Camera, Docker: Streaming Video at DramaFever
Lights, Camera, Docker: Streaming Video at DramaFeverLights, Camera, Docker: Streaming Video at DramaFever
Lights, Camera, Docker: Streaming Video at DramaFeverbridgetkromhout
 
Fluent 2018: When third parties stop being polite... and start getting real
Fluent 2018: When third parties stop being polite... and start getting realFluent 2018: When third parties stop being polite... and start getting real
Fluent 2018: When third parties stop being polite... and start getting realAkamai Developers & Admins
 
When Third Parties Stop Being Polite... and Start Getting Real
When Third Parties Stop Being Polite... and Start Getting RealWhen Third Parties Stop Being Polite... and Start Getting Real
When Third Parties Stop Being Polite... and Start Getting RealNicholas Jansma
 
Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Prajal Kulkarni
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on KubernetesJoerg Henning
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisationgrooverdan
 
When third parties stop being polite... and start getting real
When third parties stop being polite... and start getting realWhen third parties stop being polite... and start getting real
When third parties stop being polite... and start getting realCharles Vazac
 
How to create a secured multi tenancy for clustered ML with JupyterHub
How to create a secured multi tenancy for clustered ML with JupyterHubHow to create a secured multi tenancy for clustered ML with JupyterHub
How to create a secured multi tenancy for clustered ML with JupyterHubTiago Simões
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...Aman Kohli
 
Velocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attackVelocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attackCosimo Streppone
 

Similar to Open Social Data (Jaca), Alejandro Rivero (20)

"Swoole: double troubles in c", Alexandr Vronskiy
"Swoole: double troubles in c", Alexandr Vronskiy"Swoole: double troubles in c", Alexandr Vronskiy
"Swoole: double troubles in c", Alexandr Vronskiy
 
Challenges when building high profile editorial sites
Challenges when building high profile editorial sitesChallenges when building high profile editorial sites
Challenges when building high profile editorial sites
 
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Hatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudyHatohol technical-brief-20130830-hbstudy
Hatohol technical-brief-20130830-hbstudy
 
One-Man Ops
One-Man OpsOne-Man Ops
One-Man Ops
 
Lights, Camera, Docker: Streaming Video at DramaFever
Lights, Camera, Docker: Streaming Video at DramaFeverLights, Camera, Docker: Streaming Video at DramaFever
Lights, Camera, Docker: Streaming Video at DramaFever
 
Nodejs in Production
Nodejs in ProductionNodejs in Production
Nodejs in Production
 
Fluent 2018: When third parties stop being polite... and start getting real
Fluent 2018: When third parties stop being polite... and start getting realFluent 2018: When third parties stop being polite... and start getting real
Fluent 2018: When third parties stop being polite... and start getting real
 
When Third Parties Stop Being Polite... and Start Getting Real
When Third Parties Stop Being Polite... and Start Getting RealWhen Third Parties Stop Being Polite... and Start Getting Real
When Third Parties Stop Being Polite... and Start Getting Real
 
Logstash
LogstashLogstash
Logstash
 
Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
 
Catalyst MVC
Catalyst MVCCatalyst MVC
Catalyst MVC
 
When third parties stop being polite... and start getting real
When third parties stop being polite... and start getting realWhen third parties stop being polite... and start getting real
When third parties stop being polite... and start getting real
 
Intro to JavaScript
Intro to JavaScriptIntro to JavaScript
Intro to JavaScript
 
How to create a secured multi tenancy for clustered ML with JupyterHub
How to create a secured multi tenancy for clustered ML with JupyterHubHow to create a secured multi tenancy for clustered ML with JupyterHub
How to create a secured multi tenancy for clustered ML with JupyterHub
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
 
Velocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attackVelocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attack
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Open Social Data (Jaca), Alejandro Rivero

  • 1. Caso Ideal: WGET $URL > /tmp/mypage.html Nunca es el caso ideal self.conn = pycurl.Curl() # Restart connection if less than 1 byte/s is received during "timeout" seconds if isinstance(self.timeout, int): self.conn.setopt(pycurl.LOW_SPEED_LIMIT, 1) self.conn.setopt(pycurl.LOW_SPEED_TIME, self.timeout) self.conn.setopt(pycurl.URL, API_ENDPOINT_URL) self.conn.setopt(pycurl.USERAGENT, USER_AGENT) # Using gzip is optional but saves us bandwidth. self.conn.setopt(pycurl.ENCODING, 'deflate, gzip') self.conn.setopt(pycurl.POST, 1) self.conn.setopt(pycurl.POSTFIELDS, urllib.urlencode(POST_PARAMS)) self.conn.setopt(pycurl.HTTPHEADER, ['Host: stream.twitter.com', ● Necesitamos PUTs, no Solo GETS 'Authorization: %s' % self.get_oauth_header()]) # self.handle_tweet is the method that are called when new tweets arrive self.conn.setopt(pycurl.WRITEFUNCTION, self.handle_tweet) ● A veces queremos scrappear un Stream, con reconexiones ● Hay que enviar cabeceras, cookies de sesion... ● ¡En la DeepWeb hace falta user y password!
  • 2. Requests: “HTTP for Humans” >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content­type'] 'application/json; charset=utf8' >>> r.encoding 'utf­8' >>> r.text u'{"type":"User"...' >>> r.json() {u'private_gists': 419, u'total_private_repos': 77, ...} http://docs.python-requests.org/en/latest/
  • 3. ¡ajax, sesiones, navegación! ● Si con curl o requests no basta, hay que emular un navegador. ● Webdriver, en Selenium http://www.seleniumhq.org/projects/webdriver/ WebDriver driver = new FirefoxDriver(); // And now use this to visit Google driver.get("http://www.google.com"); // Alternatively the same thing can be done like this // driver.navigate().to("http://www.google.com"); // Find the text input element by its name WebElement element = driver.findElement(By.name("q")); // Enter something to search for element.sendKeys("Cheese!"); // Now submit the form. WebDriver will find the form for us from the element element.submit();
  • 4. Nettiquette (Para que no digas que nunca te lo han dicho). ● Mira el /robots.txt de los sitios que vayas a scrappear. ● Honestamente, habria que mirar tambien las cabeceras x-robots en HTTP y las tag robots en el HTML ● Controla la velocidad. Si el sitio va lento, baja la presion. ● Y al reves, para más velocidad: usar multiples IP, usar mutiples scrappers, lanzar proxies en la nube... ¿Httplib2 + squid? ● Indica en el UserAgent una forma de contactarte. Email, web.
  • 5. Parsing ● Html/xml: Sax, Xpath, … ● Json: .loads(), etc … ● JS en el server: nodejs ● BeautifulSoup import xml.etree.ElementTree as ET xsltproc :-( <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="text"/> <xsl:template match="/"> <xsl:for­each select="response/results/<xsl:value­of select="field[@id='content']" /> <xsl:text>&#xA;</xsl:text> </xsl:for­each> </xsl:template> </xsl:stylesheet> http://www.crummy.com/software/BeautifulSoup/ for link in soup.find_all('a'): print(link.get('href')) from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) print(soup.prettify())
  • 6. Almacenando y Analizando ● Postgresql: tiene extensiones json y GIS ● Mysql: … select position,json­>' user'­>>' screen_name', json­>>' text' from georaw where cod_prov='28' and st_Distance_Sphere(position::geometry, st_makepoint(­3.73679,40.44439)) < 50; ● Hdfs/hive/etc: si tienes mas de una máquina. – (o una con muchos cores) – (o podrias tenerlas y quieres usar spark, mapreduce, etc) ./bin/spark­shell ­­total­executor­cores 7 sc.textFile("hdfs://localhost:9000/user/hadoopsingle/geoRaw").filter(line => line.contains("trafico")).count val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ import org.apache.spark.sql.catalyst.expressions._ val TableHQL = hiveContext.hql ("FROM raw.csv SELECT id, type,length").groupBy(..........).persist() TableHQL.map{case Row( id, t,l) => (l.asInstanceOf[Double] * 0.30) }.reduce(_+_)