Presentation by Daniel Burseth at the MIT Big Data Explorers "Crash Course" on 9/20/2014. "An end-to-end demonstration of generating, cleaning, and visualizing a “Messy” data Set".
http://www.mitbigdataexplorers.com/
Georgian Pingbacks: Mapping Attribution Networks in a 19th-Century Newspaper ...M. H Beals
This document summarizes research on mapping attribution networks in 19th century newspaper articles. It describes "scissors-and-paste journalism" where one newspaper would reprint content from another, with or without attribution. The research used text comparison software to analyze newspaper articles from 1818-1819 and identify reprints. A network graph of reprints between newspapers was generated. Preliminary results found the most common sources of reprinted articles were from Ipswich, London, and other locations. The research helps shed light on historical newspaper practices and ethics of unattributed reprinting.
This document provides instructions for completing a travel magazine project. The instructions include 12 steps to adjust document settings, edit text, check for errors, and customize formatting and layout. Upon completing all the tasks, the edited document is saved and submitted for grading. The goal is to prepare travel content and ensure the document is error-free.
Becca Aaronson's presentation from "Visualizing Health Care Data" a ReportingOnHealth.org webinar, 7.23.15
http://www.reportingonhealth.org/content/visualizing-health-care-data
GraphQL is a query language developed by Facebook as an alternative to REST APIs. It allows clients to define the structure of the data required, and exactly the data is returned from the server. Queries in GraphQL are typed and the type system ensures the client receives the expected data. Mutations in GraphQL are used to create or update data.
TechEvent 2019: The sleeping Power of Data; Eberhard Lösch - TrivadisTrivadis
Eberhard Loesch gave a presentation on the power of data at the Trivadis TechEvent in Regensdorf, Switzerland. He showed how the world's largest companies are leveraging data to grow their business. In Switzerland, over half of companies are focusing on improving data protection, while a third are experimenting with AI. Loesch provided examples of how customer, material, and sensor data could be combined and analyzed to gain insights and optimize business processes. The event also included sessions on using data to develop new business ideas and models and leveraging AI and analytics to help children.
This document provides instructions for a science lab on analyzing different types of genetically modified corn. Students will collect data on the yield of corn with no, low, or high infestations of European corn borer. They will then create a bar graph comparing the average yields. For the final assignment, student groups will collaboratively create a one-page document summarizing their findings and recommending the best type of corn seed for farmers.
Georgian Pingbacks: Mapping Attribution Networks in a 19th-Century Newspaper ...M. H Beals
This document summarizes research on mapping attribution networks in 19th century newspaper articles. It describes "scissors-and-paste journalism" where one newspaper would reprint content from another, with or without attribution. The research used text comparison software to analyze newspaper articles from 1818-1819 and identify reprints. A network graph of reprints between newspapers was generated. Preliminary results found the most common sources of reprinted articles were from Ipswich, London, and other locations. The research helps shed light on historical newspaper practices and ethics of unattributed reprinting.
This document provides instructions for completing a travel magazine project. The instructions include 12 steps to adjust document settings, edit text, check for errors, and customize formatting and layout. Upon completing all the tasks, the edited document is saved and submitted for grading. The goal is to prepare travel content and ensure the document is error-free.
Becca Aaronson's presentation from "Visualizing Health Care Data" a ReportingOnHealth.org webinar, 7.23.15
http://www.reportingonhealth.org/content/visualizing-health-care-data
GraphQL is a query language developed by Facebook as an alternative to REST APIs. It allows clients to define the structure of the data required, and exactly the data is returned from the server. Queries in GraphQL are typed and the type system ensures the client receives the expected data. Mutations in GraphQL are used to create or update data.
TechEvent 2019: The sleeping Power of Data; Eberhard Lösch - TrivadisTrivadis
Eberhard Loesch gave a presentation on the power of data at the Trivadis TechEvent in Regensdorf, Switzerland. He showed how the world's largest companies are leveraging data to grow their business. In Switzerland, over half of companies are focusing on improving data protection, while a third are experimenting with AI. Loesch provided examples of how customer, material, and sensor data could be combined and analyzed to gain insights and optimize business processes. The event also included sessions on using data to develop new business ideas and models and leveraging AI and analytics to help children.
This document provides instructions for a science lab on analyzing different types of genetically modified corn. Students will collect data on the yield of corn with no, low, or high infestations of European corn borer. They will then create a bar graph comparing the average yields. For the final assignment, student groups will collaboratively create a one-page document summarizing their findings and recommending the best type of corn seed for farmers.
4Design Software is designed for decoration material manufacturers, distributors and retailers to display their products to customers. The Software can be used in many industries such as wall paper, curtain, mosaic tiles, floor, carpet, paint, furniture, window and door, integrated ceiling etc. It’s easy to use and can provide intuitive rendering effect pictures for customers in minutes.
Barrett's digital brown bag understanding the new language of the vivid brandBarrett Pryce
The document discusses changes to the language used by the Vivid brand to communicate more effectively with its audience. It proposes replacing business jargon and technical terms with simpler, more authentic language like "live training", "online training", "safety training systems", and "workers" to better connect with customers. The goal is to speak in a way that customers can understand and appreciate rather than using confusing industry terminology.
This document is a resume for Sanal from Dubai, UAE. It summarizes his professional experience, education, skills, and objective. Sanal has over 11 years of experience in sales, marketing, and insurance roles in India and Dubai. He is seeking a new career opportunity where he can utilize his strong communication, problem solving, and teamwork skills. His resume details his work history, roles, responsibilities, and achievements at previous employers in sales, marketing, and insurance.
El documento presenta una introducción a la tecnología de la nube y cómo esta permite conectar diferentes objetos como celulares, refrigeradores, collares para mascotas, entre otros. Explica que la nube surgió para facilitar la comunicación a distancia y ahora permite controlar y monitorear objetos del hogar de forma remota desde dispositivos móviles, como saber qué hay en el refrigerador, la ubicación de una mascota con collar conectado o el estado de una planta.
Matt Wertz 10th Anniversary Tour Submissionmollygaller
This document summarizes the 10-year friendship of Molly and Stephanie, who became super fans of the musician MaI after attending his first show in Boston. Over the decade, they saw 8 of his shows together and collected 4 concert t-shirts, introducing new fans to MaI's music along the way. After 10 years of dedication, Molly and Stephanie finally met MaI and invited him to play a backyard concert in Boston to celebrate their anniversary as fans.
1.tugas keamanan sistem dan jaringan komputerHusain-M-Ali
Website asli dan palsu dibedakan melalui bentuk tampilan dan alamat domainnya. Contohnya, website asli Liberty Reserve memiliki tampilan berbeda dari website palsunya, dan perlu memastikan alamat domain website sebelum menggunakannya.
The document contains a short message stating that the reader has died and that the author will return. It uses minimal words to convey a ominous or threatening tone without providing much contextual information.
This document discusses the principles of scientific management. It outlines that scientific management replaces old rule of thumb methods, involves scientific selection and training of workers, and requires cooperation between labor and management. It emphasizes that cooperation is based on mutual trust and a shared sense of belonging. Scientific management also aims for maximum output that benefits both workers and management, and an equal division of responsibilities between management and workers.
This article reviews Mirizzi syndrome, including its history, classification, diagnosis, and management. Mirizzi syndrome involves gallstone impaction that causes compression or fistulization of the common hepatic duct. It was first described in 1905 but named in 1948. It is now classified into two main types depending on whether fistulization is present. Diagnosis can be challenging but involves imaging like ultrasound, CT, MRCP, and ERCP. Open surgery is the gold standard treatment and provides good outcomes, with subtotal cholecystectomy recommended for types without fistula and partial cholecystectomy with fistula closure for types with fistula. Laparoscopic treatment has shown high conversion rates and risk of bile duct injury.
This document provides instructions for creating a mapping in Informatica Power Center to perform data quality checks on financial account data from a source table to load into a target table. It describes importing the source and target tables, creating a filter transformation to select records where the account number length is 8 characters and the difference between open and close dates is not less than 30 days, and generating the mapping. The objective is to map data that meets specific rules for the target system.
The document provides guidance on building an end-to-end machine learning project to predict California housing prices using census data. It discusses getting real data from open data repositories, framing the problem as a supervised regression task, preparing the data through cleaning, feature engineering, and scaling, selecting and training models, and evaluating on a held-out test set. The project emphasizes best practices like setting aside test data, exploring the data for insights, using pipelines for preprocessing, and techniques like grid search, randomized search, and ensembles to fine-tune models.
The document discusses Windows Forms in C# for developing graphical user interface (GUI) applications. It describes various controls available in the System.Windows.Forms namespace like buttons, text boxes, list boxes, tree views, timers and progress bars. It provides steps to create a Windows Forms application and add controls via the toolbox. It also discusses how to access and manipulate data from these controls, and how to persist the data to a SQL database using ADO.NET.
4Design Software is designed for decoration material manufacturers, distributors and retailers to display their products to customers. The Software can be used in many industries such as wall paper, curtain, mosaic tiles, floor, carpet, paint, furniture, window and door, integrated ceiling etc. It’s easy to use and can provide intuitive rendering effect pictures for customers in minutes.
Barrett's digital brown bag understanding the new language of the vivid brandBarrett Pryce
The document discusses changes to the language used by the Vivid brand to communicate more effectively with its audience. It proposes replacing business jargon and technical terms with simpler, more authentic language like "live training", "online training", "safety training systems", and "workers" to better connect with customers. The goal is to speak in a way that customers can understand and appreciate rather than using confusing industry terminology.
This document is a resume for Sanal from Dubai, UAE. It summarizes his professional experience, education, skills, and objective. Sanal has over 11 years of experience in sales, marketing, and insurance roles in India and Dubai. He is seeking a new career opportunity where he can utilize his strong communication, problem solving, and teamwork skills. His resume details his work history, roles, responsibilities, and achievements at previous employers in sales, marketing, and insurance.
El documento presenta una introducción a la tecnología de la nube y cómo esta permite conectar diferentes objetos como celulares, refrigeradores, collares para mascotas, entre otros. Explica que la nube surgió para facilitar la comunicación a distancia y ahora permite controlar y monitorear objetos del hogar de forma remota desde dispositivos móviles, como saber qué hay en el refrigerador, la ubicación de una mascota con collar conectado o el estado de una planta.
Matt Wertz 10th Anniversary Tour Submissionmollygaller
This document summarizes the 10-year friendship of Molly and Stephanie, who became super fans of the musician MaI after attending his first show in Boston. Over the decade, they saw 8 of his shows together and collected 4 concert t-shirts, introducing new fans to MaI's music along the way. After 10 years of dedication, Molly and Stephanie finally met MaI and invited him to play a backyard concert in Boston to celebrate their anniversary as fans.
1.tugas keamanan sistem dan jaringan komputerHusain-M-Ali
Website asli dan palsu dibedakan melalui bentuk tampilan dan alamat domainnya. Contohnya, website asli Liberty Reserve memiliki tampilan berbeda dari website palsunya, dan perlu memastikan alamat domain website sebelum menggunakannya.
The document contains a short message stating that the reader has died and that the author will return. It uses minimal words to convey a ominous or threatening tone without providing much contextual information.
This document discusses the principles of scientific management. It outlines that scientific management replaces old rule of thumb methods, involves scientific selection and training of workers, and requires cooperation between labor and management. It emphasizes that cooperation is based on mutual trust and a shared sense of belonging. Scientific management also aims for maximum output that benefits both workers and management, and an equal division of responsibilities between management and workers.
This article reviews Mirizzi syndrome, including its history, classification, diagnosis, and management. Mirizzi syndrome involves gallstone impaction that causes compression or fistulization of the common hepatic duct. It was first described in 1905 but named in 1948. It is now classified into two main types depending on whether fistulization is present. Diagnosis can be challenging but involves imaging like ultrasound, CT, MRCP, and ERCP. Open surgery is the gold standard treatment and provides good outcomes, with subtotal cholecystectomy recommended for types without fistula and partial cholecystectomy with fistula closure for types with fistula. Laparoscopic treatment has shown high conversion rates and risk of bile duct injury.
This document provides instructions for creating a mapping in Informatica Power Center to perform data quality checks on financial account data from a source table to load into a target table. It describes importing the source and target tables, creating a filter transformation to select records where the account number length is 8 characters and the difference between open and close dates is not less than 30 days, and generating the mapping. The objective is to map data that meets specific rules for the target system.
The document provides guidance on building an end-to-end machine learning project to predict California housing prices using census data. It discusses getting real data from open data repositories, framing the problem as a supervised regression task, preparing the data through cleaning, feature engineering, and scaling, selecting and training models, and evaluating on a held-out test set. The project emphasizes best practices like setting aside test data, exploring the data for insights, using pipelines for preprocessing, and techniques like grid search, randomized search, and ensembles to fine-tune models.
The document discusses Windows Forms in C# for developing graphical user interface (GUI) applications. It describes various controls available in the System.Windows.Forms namespace like buttons, text boxes, list boxes, tree views, timers and progress bars. It provides steps to create a Windows Forms application and add controls via the toolbox. It also discusses how to access and manipulate data from these controls, and how to persist the data to a SQL database using ADO.NET.
Scraping is a process that extracts data from web pages and formats it for use in spreadsheets. This document discusses several tools for scraping data without programming including Google Spreadsheets functions, a Chrome extension, and Import.io. It provides step-by-step instructions for using these tools to scrape data from web pages and export it to spreadsheets or other formats.
Bridging data analysis and interactive visualizationNacho Caballero
Clickme is an R package that lets you generate interactive visualizations directly from R. I presented the latest iteration at the 2013 IBSB conference in Kyoto
Spatial query tutorial for nyc subway income level along subwayVivian S. Zhang
This document provides instructions for a spatial query tutorial on analyzing New York City subway data using SpatiaLite. It describes how to load MTA subway and census tract shapefile data, as well as csv files containing subway station and census income data, into a SpatiaLite database. It then explains how to create views to clean the raw data and perform joins between the spatial and attribute tables to enable spatial queries, such as determining the population within a quarter mile of subway stations and their average income.
Rapid Development and Performance By Transitioning from RDBMSs to MongoDB
Modern day application requirements demand rich & dynamic data structures, fast response times, easy scaling, and low TCO to match the rapidly changing customer & business requirements plus the powerful programming languages used in today's software landscape.
Traditional approaches to solutions development with RDBMSs increasingly expose the gap between the modern development languages and the relational data model, and between scaling up vs. scaling horizontally on commodity hardware. Development time is wasted as the bulk of the work has shifted from adding business features to struggling with the RDBMSs.
MongoDB, the premier NoSQL database, offers a flexible and scalable solution to focus on quickly adding business value again.
In this session, we will provide:
- Overview of MongoDB's capabilities
- Code-level exploration of the MongoDB programming model and APIs and how they transform the way developers interact with a database
- Update of the exciting features in MongoDB 3.0
The document discusses using MapReduce for a sequential web access-based recommendation system. It explains how web server logs could be mapped to create a pattern tree showing frequent sequences of accessed web pages. When making recommendations for a user, their access pattern would be compared to patterns in the tree to find matching branches to suggest. MapReduce is well-suited for this because it can efficiently process and modify the large, dynamic tree structure across many machines in a fault-tolerant way.
This presentation discusses the importance of data architecture and database security. It emphasizes creating precise data structures when handling large amounts of data, as companies rely on manipulating data. Normalizing data structures improves performance, maintenance and flexibility. Examples show how normalizing a persons table reduces data size by 97% and speeds up queries by 40-150x. The speaker recommends best practices for data structure, security and limiting direct data access to improve protection.
Line Graph Analysis using R Script for Intel Edison - IoT Foundation Data - N...WithTheBest
This document provides instructions for analyzing sensor data from an Intel Edison board using various IBM Cloud services. It describes how to send temperature sensor data from the Edison to the IBM IoT Platform and Node-RED. It then explains how to store the sensor data in a DB2 database table and analyze it using Watson Analytics. Finally, it covers analyzing the data using Apache Spark on the IBM Cloud.
This document discusses data representation in C# and ADO.NET. It begins by explaining that C# objects are similar to Java objects but with properties instead of getter/setter methods. It then covers how to create a class with properties in C# and use objects. The document also discusses encapsulation in ADO.NET and how it handles connecting to databases. It provides steps for connecting to a database, creating a data adapter and dataset, binding controls to display data, and adding code to populate the dataset and allow navigation between records.
This document provides steps to transfer operational data from an ECS system to Excel for analysis and visualization. Key steps include:
1) Using the ECS interface to select parameters and timeframe for trend data capture.
2) Copying the trend data to the clipboard and pasting into an Excel sheet.
3) Formatting the pasted data into separate columns for analysis and plotting.
4) Creating scatter plots of variables over time and interpreting relationships.
Potter's Wheel is an interactive tool for data transformation, cleaning and analysis. It integrates data auditing, transformation and analysis. The user can specify transformations by example through a spreadsheet interface. It detects discrepancies and flags them for the user. Transformations can be stored as programs to apply to data. It allows interactive exploration of data without waiting through partitioning and aggregation.
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB
Data analytics can offer insights into your business and help take it to the next level. In this talk you'll learn about MongoDB tools for building visualizations, dashboards and interacting with your data. We'll start with exploratory data analysis using MongoDB Compass. Then, in a matter of minutes, we'll take you from 0 to 1 - connecting to your Atlas cluster via BI Connector and running analytical queries against it in Microsoft Excel. We'll also showcase the new MongoDB Charts product and you'll see how quick, easy and intuitive analytics can be on the MongoDB platform without flattening the data or spending time and effort on complicated and fragile ETL.
This document discusses building a single database containing all web data by creating a scalable web crawler, data store, and data retrieval system. It describes the challenges of collecting and structuring data from millions of websites, building a NoSQL data store using Cassandra to handle terabytes of data, and providing an intuitive RESTful API for querying the unified database. The project aims to make web data easily accessible through a single source as if querying a database.
This document discusses the concept of big data. It defines big data as massive volumes of structured and unstructured data that are difficult to process using traditional database techniques due to their size and complexity. It notes that big data has the characteristics of volume, variety, and velocity. The document also discusses Hadoop as an implementation of big data and how various industries are generating large amounts of data.
Similar to MIT Big Data Explorers - presentation by Daniel Burseth (20)
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
6. AWS -> EC2
Launch instance: ami-c6b61fae (US-EAST)
Instance type m3.medium
Connect
You should see some software on the desktop
7. Scrape all of Craiglist’s Boston apartment listings using WebHarvy
Examine, clean, and prepare the data set using OpenRefine
Map our data and apply filters using Tableau
……all without writing a single line of code.
8.
9. A hyper-intelligent utility to scrape website
data.
SysNucleus, makers of USBTrace
Heavy duty alternatives: Scrapy (scrappy.org),
Beautiful Soup
10. HTTP://SHOUTKEY.COM/WIRE
1. Start Config
2. Click on Hungry Mother –
capture text
3. Click on Hungry Mother –
capture URL
4. Click on Kendall
Square/MIT – capture text
5. Click lasts review–
capture text
CLEAR
1. Mine -> Scrape a list of
similar links
2. Click on Hungry Mother
11. Let’s start collecting
information in the first sub-page.
12. Edit Clear
Navigate into a sub-page
Start Config
Set as Next Page Link
13. Scheduler
Input keywords
Puase Inject (word of caution: scraping often violates TOS. Potentially not viable
for apps, commercial purposes!)
TRY VISITING CRAIGSLIST IN AWS BTW!!
Proxy
Database export
14. Download Craigslist Boston from http://shoutkey.com/glorify
Look at our data: open Boston Dirty.csv (20k rows of mess!)
Time to CLEAN: Launch GOOGLE-REFINE.EXE
Within MOZILLA, navigate to http://127.0.0.1:3333/
Create Project -> This Computer -> Browse
Parse by tab
Create Project
15. 1. First, sort your column.
2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of
the middle of the data table.
3. Then invoke Edit cells and Blank down on the Title column.
4. Then on that column, invoke menu Facet > Custom facets and Facet by blank.
5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown
menu.
6. Remove the facet.
29. Great “semantic” example. Tableau understands that this text translates to a
lat/long
30. Look on the map in the lower right corner
Let’s “Filter Data”
31. Under “Measures”, drag “Price” onto size in “Marks”
Change sum(Price) to avg(Price)
Drag Price, change to max(price) into Filters and select an “At Most”
Right click on the filter and show “Quick Filter”
Drag “City” onto “Label”
Menu Map -> Map Options
Click on a node for info and drill down potential
32.
33. 1. Explored various webpage structures and scraped them
2. Exported the data to Refine
3. Parsed columns to extract critical price and location information
4. Used clustering algorithms to merge related geographies
5. Applied filters to identify errant prices
6. Exported the data to Tableau
7. Completed a real cursory mapping visualization