SlideShare a Scribd company logo
1 of 72
Introduction to Data Science
Lecture 3
Manipulating Tabular Data
Intro. to Data Science Fall 2015
John Canny
including notes from Michael Franklin and
others
Outline for this Evening
• Two views of tables:
– SQL/Pandas
– OLAP/Numpy/Matlab
• SQL, NoSQL
– Non-Tabular Structures
Data Science – One Definition
The Big Picture
4
Extract
Transform
Load
Two views of tables
First view
Key Concept: Structured Data
A data model is a collection of concepts for
describing data.
A schema is a description of a particular
collection of data, using a given data model.
The Relational Model*
• The Relational Model is Ubiquitous:
• MySQL, PostgreSQL, Oracle, DB2, SQLServer, …
• Foundational work done at
• IBM - System R
• UC Berkeley - Ingres
• Object-oriented concepts have been merged in
• Early work: POSTGRES research project at Berkeley
• Informix, IBM DB2, Oracle 8i
• Also has support for XML (semi-structured data)
*Codd, E. F. (1970). "A relational model of data for large shared data banks".
Communications of the ACM 13 (6): 37
E. F., “Ted” Codd
Turing Award 1981
Relational Database: Definitions
• Relational database: a set of relations
• Relation: made up of 2 parts:
Schema : specifies name of relation, plus name and
type of each column
Students(sid: string, name: string, login: string,
age: integer, gpa: real)
Instance : the actual data at a given time
• #rows = cardinality
• #fields = degree / arity
• A relation is a mathematical object (from set theory)
which is true for certain arguments.
• An instance defines the set of arguments for which the
relation is true (it’s a table not a row).
Ex: Instance of Students Relation
sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@eecs 18 3.2
53650 Smith smith @math 19 3.8
• Cardinality = 3, arity = 5 , all rows distinct
• The relation is true for these tuples and false for others
SQL - A language for Relational DBs*
• SQL = Structured Query Language
• Data Definition Language (DDL)
– create, modify, delete relations
– specify constraints
– administer users, security, etc.
• Data Manipulation Language (DML)
– Specify queries to find tuples that satisfy criteria
– add, modify, remove tuples
• The DBMS is responsible for efficient evaluation.
* Developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the 1970s.
Used to be SEQUEL (Structured English QUEry Language)
Creating Relations in SQL
• Create the Students relation.
– Note: the type (domain) of each field is specified,
and enforced by the DBMS whenever tuples are
added or modified.
CREATE TABLE Students
(sid CHAR(20),
name CHAR(20),
login CHAR(10),
age INTEGER,
gpa FLOAT)
Table Creation (continued)
• Another example: the Enrolled table holds
information about courses students take.
CREATE TABLE Enrolled
(sid CHAR(20),
cid CHAR(20),
grade CHAR(2))
Adding and Deleting Tuples
• Can insert a single tuple using:
INSERT INTO Students (sid, name, login, age, gpa)
VALUES ('53688', 'Smith', 'smith@ee', 18, 3.2)
• Can delete all tuples satisfying some condition (e.g.,
name = Smith):
DELETE
FROM Students S
WHERE S.name = 'Smith'
Queries in SQL
• Single-table queries are straightforward.
• To find all 18 year old students, we can write:
SELECT *
FROM Students S
WHERE S.age=18
• To find just names and logins, replace the first line:
SELECT S.name, S.login
Joins and Inference
• Chaining relations together is the basic inference
method in relational DBs. It produces new
relations (effectively new facts) from the data:
SELECT S.name, M.mortality
FROM Students S, Mortality M
WHERE S.Race=M.Race
Name Race
Socrates Man
Thor God
Barney Dinosaur
Blarney stone Stone
Race Mortality
Man Mortal
God Immortal
Dinosaur Mortal
Stone Non-living
M
S
Joins and Inference
• Chaining relations together is the basic inference
method in relational DBs. It produces new
relations (effectively new facts) from the data:
SELECT S.name, M.mortality
FROM Students S, Mortality M
WHERE S.Race=M.Race
Name Mortality
Socrates Mortal
Thor Immortal
Barney Mortal
Blarney stone Non-living
Basic SQL Query
• relation-list : A list of relation names
• possibly with a range-variable after each name
• target-list : A list of attributes of tables in relation-list
• qualification : Comparisons combined using AND, OR and
NOT.
• Comparisons are Attr op const or Attr1 op Attr2, where
op is one of =≠<>≤≥
• DISTINCT: optional keyword indicating that the
answer should not contain duplicates.
• In SQL SELECT, the default is that duplicates are not
eliminated! (Result is called a “multiset”)
SELECT [DISTINCT] target-list
FROM relation-list
WHERE qualification
SQL Inner Joins
SELECT S.name, E.classid
FROM Students S (INNER) JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
E
S
Note the previous version of this query (with no join keyword) is an “Implicit join”
SQL Inner Joins
SELECT S.name, E.classid
FROM Students S (INNER) JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
E
S
Unmatched keys
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
Brown NULL
E
S
SQL Joins
SELECT S.name, E.classid
FROM Students S LEFT OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
Brown NULL
E
S
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
E
S
SQL Joins
SELECT S.name, E.classid
FROM Students S RIGHT OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
E
S
SQL Joins
SELECT S.name, E.classid
FROM Students S ? JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
Brown NULL
E
S
SQL Joins
SELECT S.name, E.classid
FROM Students S FULL OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
Brown NULL
E
S
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
E
S
SQL Joins
SELECT S.name, E.classid
FROM Students S LEFT SEMI JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
E
S
What kind of Join is this?
SELECT *
FROM Students S ?? Enrolled E
S.name S.sid
Jones 11111
Smith 22222
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
S.name S.sid E.sid E.classid
Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 11111 History105
Smith 22222 11111 DataScience194
Smith 22222 22222 French150
E
S
SQL Joins
SELECT *
FROM Students S CROSS JOIN Enrolled E
S.name S.sid
Jones 11111
Smith 22222
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
S.name S.sid E.sid E.classid
Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 11111 History105
Smith 22222 11111 DataScience194
Smith 22222 22222 French150
E
S
What kind of Join is this?
SELECT *
FROM Students S, Enrolled E
WHERE S.sid <= E.sid
S.name S.sid
Jones 11111
Smith 22222
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
S.name S.sid E.sid E.classid
Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 22222 French150
E
S
Theta Joins
SELECT *
FROM Students S, Enrolled E
WHERE S.sid <= E.sid
S.name S.sid
Jones 11111
Smith 22222
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
S.name S.sid E.sid E.classid
Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 22222 French150
E
S
Recall: Tweet JSON Format
{"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"in_reply_to_user_id"=>nil,
"in_reply_to_screen_name"=>nil,
"in_reply_to_status_id"=>nil
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"screen_name"=>"twitterapi",
"name"=>"Twitter API",
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com",
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
"profile_background_image_url"=>
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
"profile_background_tile"=>false,
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44",
"profile_sidebar_fill_color"=>"e0ff92",
"profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
"contributors_enabled"=>true,
"favourites_count"=>1,
"statuses_count"=>1628,
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)",
"utc_offset"=>-28800,
"lang"=>"en",
"protected"=>false,
"followers_count"=>100581,
"geo_enabled"=>true,
"notifications"=>false,
"following"=>true,
"verified"=>true},
"contributors"=>[3191321],
"geo"=>nil,
"coordinates"=>nil,
"place"=>
{"id"=>"2b6ff8c22edd9576",
"url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",
"name"=>"SoMa",
"full_name"=>"SoMa, San Francisco",
"place_type"=>"neighborhood",
"country_code"=>"US",
"country"=>"The United States of America",
"bounding_box"=>
{"coordinates"=>
[[[-122.42284884, 37.76893497],
[-122.3964, 37.76893497],
[-122.3964, 37.78752897],
[-122.42284884, 37.78752897]]],
"type"=>"Polygon"}},
"source"=>"web"}
The tweet's unique ID. These
IDs are roughly sorted &
developers should treat them
as opaque (http://bit.ly/dCkppc).
Text of the tweet.
Consecutive duplicate tweets
are rejected. 140 character
max (http://bit.ly/4ud3he).
Tweet's
creation
date.
DEPRECATED
The ID of an existing tweet that
this tweet is in reply to. Won't
be set unless the author of the
referenced tweet is mentioned.
The screen name &
user ID of replied to
tweet author.
Truncated to 140
characters. Only
possible from SMS.
The
author
of
the
tweet.
This
embedded
object
can
get
out
of
sync.
The
author's
user
ID.
The author's
user name.
The author's
screen name.
The author's
biography.
The author's
URL.
The author's "location". This is a free-form text field, and
there are no guarantees on whether it can be geocoded.
Rendering information
for the author. Colors
are encoded in hex
values (RGB).
The creation date
for this account.
Whether this account has
contributors enabled
(http://bit.ly/50npuu). Number of
favorites this
user has.
Number
of
tweets
this
user
has.
Number of
users this user
is following.
The timezone and offset
(in seconds) for this user.
The user's selected
language.
Whether this user is protected
or not. If the user is protected,
then this tweet is not visible
except to "friends".
Number of
followers for
this user.
Whether
this
user
has
geo
enabled
(http://bit.ly/4pFY77).
DEPRECATED
in this context
Whether this user
has a verified badge.
The
geo
tag
on
this
tweet
in
GeoJSON
(http://bit.ly/b8L1Cp).
The contributors' (if any) user
IDs (http://bit.ly/50npuu).
DEPRECATED
The place associated with this
Tweet (http://bit.ly/b8L1Cp).
The place ID
The URL to fetch a detailed
polygon for this place
The printable names of this place
The type of this
place - can be a
"neighborhood"
or "city"
The country this place is in
The bounding
box for this
place
The application
that sent this
tweet
Map of a Twitter Status Object
Raffi Krikorian <raffi@twitter.com>
18 April 2010
{"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"in_reply_to_user_id"=>nil,
"in_reply_to_screen_name"=>nil,
"in_reply_to_status_id"=>nil
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"screen_name"=>"twitterapi",
"name"=>"Twitter API",
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.
The tweet's unique ID. These
IDs are roughly sorted &
developers should treat them
as opaque (http://bit.ly/dCkppc).
Text of the tweet.
Consecutive duplicate tweets
are rejected. 140 character
max (http://bit.ly/4ud3he).
DEPRECATED
The ID of an existing tweet that
this tweet is in reply to. Won't
be set unless the author of the
referenced tweet is mentioned.
The screen name &
user ID of replied to
tweet author.
Truncated to 140
characters. Only
possible from SMS.
.
This
ut
of
sync.
The
author's
user
ID.
The author's
user name.
The author's
screen name.
Normalization
Raw twitter data storage is very inefficient because, e.g. user
records are repeated with every tweet by that user.
Normalization is the process of minimizing data redundancy.
User.id Name Attribs…
111 Jones
222 Smith
Loc.id Name Attribs…
1111 Berkeley
2222 Oakland
3333 Hayward
Tweet id User id Location id Body
11 111 1111 I need a Jamba juice
22 111 1111 Cal Soccer rules
33 111 2222 Why do we procrastinate?
44 222 3333 Close your eyes and push “go”
Normalization
Normalized tables include only a foreign key to the information in
another table for repeated data.
The original table is the result of inner joins between tables.
User.id Name Attribs…
111 Jones
222 Smith
Loc.id Name Attribs…
1111 Berkeley
2222 Oakland
3333 Hayward
Tweet id User id Location id Body
11 111 1111 I need a Jamba juice
22 111 1111 Cal Soccer rules
33 111 2222 Why do we procrastinate?
44 222 3333 Close your eyes and push “go”
Aggregate Queries
Including reference counts in the lookup tables allows you to
perform aggregate queries on those tables alone:
Average age of users, most popular location,…
U.id Name Count Attr..
111 Jones 3
222 Smith 1
L.id Name Count Attr…
1111 Berkeley 2
2222 Oakland 1
3333 Hayward 1
Tweet id User id Location id Body
11 111 1111 I need a Jamba Juice
22 111 1111 Cal Soccer rules
33 111 2222 Why do we procrastinate?
44 222 3333 Close your eyes and push “go”
SQL Query Semantics
Semantics of an SQL query are defined in terms of the
following conceptual evaluation strategy:
1. do FROM clause: compute cross-product of tables (e.g.,
Students and Enrolled).
2. do WHERE clause: Check conditions, discard tuples
that fail. (i.e., “selection”).
3. do SELECT clause: Delete unwanted fields. (i.e.,
“projection”).
4. If DISTINCT specified, eliminate duplicate rows.
Probably the least efficient way to compute a query!
– An optimizer will find more efficient strategies to get
the same answer.
38
Data Model (Tabular)
• SQLite
– Table: fixed number of named columns of specified
type
– 5 storage classes for columns
• NULL
• INTEGER
• REAL
• TEXT
• BLOB
– Data stored on disk in a single file in row-major order
– Operations performed via sqlite3 shell
39
Reductions and GroupBy
• One of the most common operations on Data Tables is
aggregation or reduction (count, sum, average, min, max,…).
• They provide a means to see high-level patterns in the data,
to make summaries of it etc.
• You need ways of specifying which columns are being
aggregated over, which is the role of a GroupBy operator.
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7
40
Reductions and GroupBy
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7
SELECT SID, Name, AVG(GPA)
FROM Students
GROUP BY SID
41
Reductions and GroupBy
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7
SELECT SID, Name, AVG(GPA)
FROM Students
GROUP BY SID
SID Name GPA
111 Jones 3.35
222 Smith 3.1
42
Pandas/Python
• Series: a named, ordered dictionary
– The keys of the dictionary are the indexes
– Built on NumPy’s ndarray
– Values can be any Numpy data type object
• DataFrame: a table with named columns
– Represented as a Dict (col_name -> series)
– Each Series object represents a column
43
Operations
• map() functions
• filter (apply predicate to rows)
• sort/group by
• aggregate: sum, count, average, max, min
• Pivot or reshape
• Relational:
– union, intersection, difference, cartesian product
(CROSS JOIN)
– select/filter, project
– join: natural join (INNER JOIN), theta join, semi-join,
etc.
– rename
44
Pandas vs SQL
+ Pandas is lightweight and fast.
+ Full SQL expressiveness plus the expressiveness of
Python, especially for function evaluation.
+ Integration with plotting functions like Matplotlib.
- Tables must fit into memory.
- No post-load indexing functionality: indices are built
when a table is created.
- No transactions, journaling, etc.
- Large, complex joins probably slower.
Jacobs Update
• Room 310 not ready, but other rooms are. For
the next few (?) weeks, we will meet:
• Mondays in 155 Donner
• Wednesdays in 110/120 Jacobs Hall
Starting this Weds.
5 min break
The other view of tables: OLAP
• OnLine Analytical Processing
• Conceptually like an n-dimensional spreadsheet (Cube)
• (Discrete) columns become dimensions
• The goal is live interaction with numerical data for business
intelligence
The other view of tables: OLAP
From a table to a cube:
name classid Semester Grade Units
Jones History105 F13 3.3 4.0
Jones DataScience194 S12 4.0 3.0
Jones French150 F14 3.7 4.0
Smith History105 S15 2.3 3.0
Smith DataScience194 F14 2.7 3.0
Smith French150 F13 3.0 4.0
From tables to OLAP cubes
From a table to a cube:
name classid Semester Grade Units
Jones History105 F13 3.3 4.0
Jones DataScience194 S12 4.0 3.0
Jones French150 F14 3.7 4.0
Smith History105 S15 2.3 3.0
Smith DataScience194 F14 2.7 3.0
Smith French150 F13 3.0 4.0
Variables used as qualifiers
(In where, GroupBy clauses)
Normally discrete
Variables we want to measure
Normally numeric
Constructing OLAP cubes
name classid Semester Grade Units
Jones History105 F13 3.3 4.0
Jones DataScience194 S12 4.0 3.0
… … … … …
Name
Classid
Semester
Cell contents are Grade, Unit values
Cube
dimensions
Cube
values
Queries on OLAP cubes
• Once the cube is defined, its easy to do aggregate queries by
projecting along one or more axes.
• E.g. to get student GPAs, we project the Grade field onto the
student (Name) axis.
• In fact, such aggregates are precomputed and maintained
automatically in an OLAP cube, so queries are instantaneous.
Name
Classid
Semester
Cell contents are Grade, Unit values
OLAP
• Slicing:
fixing one or
more variables
• Dicing:
selecting a range of
values for one or
more variables
OLAP
• Drilling Up/Down
(change levels of a
hierarchically-indexed
variable)
• Pivoting:
produce a two-axis
view for viewing
as a spreadsheet.
Outline
• To support real-time querying, OLAP DBs store aggregates
of data values along many dimensions.
• This works best if axes can be tree-structured. E.g time can
be expressed as a hierarchy
hour  day  week  month  year
OLAP tradeoffs
• Aggregates increase space and the cost of updates.
• On the other hand, since they are projections of data, or
tree structures, the storage overhead can be small.
• Aggregates are limited, but cover a lot of common cases:
avg, stdev, min, max.
• Operations (slice, dice, pivot, etc.) are conceptually simpler
than SQL, but cover a lot of common cases.
• Good integration with clients, e.g. spreadsheets, for visual
interaction, although there is an underlying query
language (MDX).
Numpy/Matlab and OLAP
• Numpy and Matlab have an efficient implementation of nd-
arrays for dense data.
• Indices must be integer, but you can implement general
indices using dictionaries from indexval->int.
• Slicing and dicing are available using index ranges:
a[5,1:3,:] etc.
• Roll-down/up involve aggregates along dimensions such as
sum(a[3,4:6,:],2)
• Pivoting involves index permutations (.transpose()) and
aggregation over the other indices.
• Limitation: MATLAB and Numpy currently only support dense
nd-arrays (or sparse 2d arrays).
What’s Wrong with Tables?
• Too limited in structure?
• Too rigid?
• Too old fashioned?
What’s Wrong with (RDBMS) Tables?
• Indices: Typical RDBMS table storage is mostly indices
– Cant afford this overhead for large datastores
• Transactions:
– Safe state changes require journals etc., and are slow
• Relations:
– Checking relations adds further overhead to updates
• Sparse Data Support:
– RDBMS Tables are very wasteful when data is very sparse
– Very sparse data is common in modern data stores
– RDBMS tables might have dozens of columns, modern data
stores might have many thousands.
RDBMS tables – row based
Table:
Represented as:
sid name login age gpa
53831 Jones jones@cs 18 3.4
53831 Smith smith@ee 18 3.2
53831 Jones jones@cs 18 3.4
53831 Smith smith@ee 18 3.2
Tweet JSON Format
{"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"in_reply_to_user_id"=>nil,
"in_reply_to_screen_name"=>nil,
"in_reply_to_status_id"=>nil
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"screen_name"=>"twitterapi",
"name"=>"Twitter API",
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com",
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
"profile_background_image_url"=>
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
"profile_background_tile"=>false,
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44",
"profile_sidebar_fill_color"=>"e0ff92",
"profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
"contributors_enabled"=>true,
"favourites_count"=>1,
"statuses_count"=>1628,
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)",
"utc_offset"=>-28800,
"lang"=>"en",
"protected"=>false,
"followers_count"=>100581,
"geo_enabled"=>true,
"notifications"=>false,
"following"=>true,
"verified"=>true},
"contributors"=>[3191321],
"geo"=>nil,
"coordinates"=>nil,
"place"=>
{"id"=>"2b6ff8c22edd9576",
"url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",
"name"=>"SoMa",
"full_name"=>"SoMa, San Francisco",
"place_type"=>"neighborhood",
"country_code"=>"US",
"country"=>"The United States of America",
"bounding_box"=>
{"coordinates"=>
[[[-122.42284884, 37.76893497],
[-122.3964, 37.76893497],
[-122.3964, 37.78752897],
[-122.42284884, 37.78752897]]],
"type"=>"Polygon"}},
"source"=>"web"}
The tweet's unique ID. These
IDs are roughly sorted &
developers should treat them
as opaque (http://bit.ly/dCkppc).
Text of the tweet.
Consecutive duplicate tweets
are rejected. 140 character
max (http://bit.ly/4ud3he).
Tweet's
creation
date.
DEPRECATED
The ID of an existing tweet that
this tweet is in reply to. Won't
be set unless the author of the
referenced tweet is mentioned.
The screen name &
user ID of replied to
tweet author.
Truncated to 140
characters. Only
possible from SMS.
The
author
of
the
tweet.
This
embedded
object
can
get
out
of
sync.
The
author's
user
ID.
The author's
user name.
The author's
screen name.
The author's
biography.
The author's
URL.
The author's "location". This is a free-form text field, and
there are no guarantees on whether it can be geocoded.
Rendering information
for the author. Colors
are encoded in hex
values (RGB).
The creation date
for this account.
Whether this account has
contributors enabled
(http://bit.ly/50npuu). Number of
favorites this
user has.
Number
of
tweets
this
user
has.
Number of
users this user
is following.
The timezone and offset
(in seconds) for this user.
The user's selected
language.
Whether this user is protected
or not. If the user is protected,
then this tweet is not visible
except to "friends".
Number of
followers for
this user.
Whether
this
user
has
geo
enabled
(http://bit.ly/4pFY77).
DEPRECATED
in this context
Whether this user
has a verified badge.
The
geo
tag
on
this
tweet
in
GeoJSON
(http://bit.ly/b8L1Cp).
The contributors' (if any) user
IDs (http://bit.ly/50npuu).
DEPRECATED
The place associated with this
Tweet (http://bit.ly/b8L1Cp).
The place ID
The URL to fetch a detailed
polygon for this place
The printable names of this place
The type of this
place - can be a
"neighborhood"
or "city"
The country this place is in
The bounding
box for this
place
The application
that sent this
tweet
Map of a Twitter Status Object
Raffi Krikorian <raffi@twitter.com>
18 April 2010
{"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"in_reply_to_user_id"=>nil,
"in_reply_to_screen_name"=>nil,
"in_reply_to_status_id"=>nil
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"screen_name"=>"twitterapi",
"name"=>"Twitter API",
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.
The tweet's unique ID. These
IDs are roughly sorted &
developers should treat them
as opaque (http://bit.ly/dCkppc).
Text of the tweet.
Consecutive duplicate tweets
are rejected. 140 character
max (http://bit.ly/4ud3he).
DEPRECATED
The ID of an existing tweet that
this tweet is in reply to. Won't
be set unless the author of the
referenced tweet is mentioned.
The screen name &
user ID of replied to
tweet author.
Truncated to 140
characters. Only
possible from SMS.
.
This
ut
of
sync.
The
author's
user
ID.
The author's
user name.
The author's
screen name.
RDBMS tables – row based
Table:
Represented as:
ID name login loc locid LAT LONG ALT State
52841 Jones jones@cs NULL NULL NULL NULL NULL NULL
53831 Smith smith@ee NULL NULL NULL NULL NULL NULL
55541 Brown brown@ee NULL NULL NULL NULL NULL NULL
53831 Smith smith@ee NULL NULL NULL NULL NULL NULL
52841 Jones jones@cs NULL NULL NULL NULL NULL NULL
55541 Brown brown@ee NULL NULL NULL NULL NULL NULL
Column-based store
Table:
Represented as column (key-value) stores:
ID name login loc locid LAT LONG ALT State
52841 Jones jones@cs Albany 2341 38.4 122.7 100 CA
53831 Smith smith@ee NULL NULL NULL NULL NULL NULL
55541 Brown brown@ee NULL NULL NULL NULL NULL NULL
ID name
52841 Jones
53831 Smith
55541 Brown
ID login
52841 jones@cs
53831 smith@ee
55541 brown@ee
ID loc
52841 Albany
ID locid
52841 2341
ID LAT
52841 38.4
ID LONG
52841 122.7
…
NoSQL Storage Systems
64
Column-Family Stores (Cassandra)
A column-family groups data columns together, and is
analogous to a table (and similar to Pandas DataFrame)
Static column family from Apache Cassandra:
Dynamic Column family (Cassandra):
Can add or
remove columns
from a dynamic
column family
Columns fixed
CouchDB Data Model (JSON)
• “With CouchDB, no schema is enforced, so new document
types with new meaning can be safely added alongside
the old.”
• A CouchDB document is an object that consists of named
fields. Field values may be:
– strings, numbers, dates,
– ordered lists, associative maps
66
"Subject": "I like Plankton"
"Author": "Rusty"
"PostedDate": "5/23/2006"
"Tags": ["plankton", "baseball", "decisions"]
"Body": "I decided today that I don't like baseball. I like plankton."
Key-value stores
• A key-value store is an even simpler approach.
• It implements storage and retrieval of (key,value) pairs.
• i.e. Basic functionality is that of a dictionary
age[“john”] = 25.
• But some KV-stores also implement sorting and
indexing with the keys (e.g. leveldb).
• You can build either column-based or row-based DBs
on top of such KV-stores to optimize performance (e.g.
omitting indices or ACID qualities).
67
Pig
• Started at Yahoo! Research
• Features:
– Expresses sequences of MapReduce jobs
– Data model: nested “bags” of items
• Schema is optional
– Provides relational (SQL) operators
(JOIN, GROUP BY, etc)
– Easy to plug in Java functions
An Example Problem
Suppose you have user
data in one file, website
data in another, and you
need to find the top 5
most visited pages by
users aged 18-25.
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
In MapReduce
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
In Pig Latin
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Hive
• Developed at Facebook
• Relational database built on Hadoop
– Maintains table schemas
– SQL-like query language (which can also call
Hadoop Streaming scripts)
– Supports table partitioning,
complex data types, sampling,
some query optimization
• Used for most Facebook jobs
– Less than 1% of daily jobs at Facebook use
MapReduce directly!!! (SQL – or PIG – wins!)
– Note: Google also has several SQL-like systems in use.
Summary
• Two views of tables:
– SQL/Pandas
– OLAP/Numpy/Matlab
• SQL, NoSQL
– Non-Tabular Structures
Wednesday come to 110/120 Jacobs Hall for
Pandas Lab

More Related Content

Similar to F15CS194Lec03TabularData.pptx

Keyword-based Search and Exploration on Databases (SIGMOD 2011)
Keyword-based Search and Exploration on Databases (SIGMOD 2011)Keyword-based Search and Exploration on Databases (SIGMOD 2011)
Keyword-based Search and Exploration on Databases (SIGMOD 2011)weiw_oz
 
RDBMS Arch & Models
RDBMS Arch & ModelsRDBMS Arch & Models
RDBMS Arch & ModelsSarmad Ali
 
Basic IMS For Applications
Basic IMS For ApplicationsBasic IMS For Applications
Basic IMS For ApplicationsDan O'Dea
 
Lecture-2 - Relational Model.pptx
Lecture-2 - Relational Model.pptxLecture-2 - Relational Model.pptx
Lecture-2 - Relational Model.pptxHanzlaNaveed1
 
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH) Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH) David Fombella Pombal
 
CSCI 4140-AA Pre-Req.ppt
CSCI 4140-AA Pre-Req.pptCSCI 4140-AA Pre-Req.ppt
CSCI 4140-AA Pre-Req.pptZahidKhan671907
 
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docx
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docxL1 Intro to Relational DBMS LP.pdfIntro to Relational .docx
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docxDIPESH30
 
Query Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data SourcesQuery Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data SourcesJie Bao
 
An hour with Database and SQL
An hour with Database and SQLAn hour with Database and SQL
An hour with Database and SQLIraj Hedayati
 
[Www.pkbulk.blogspot.com]dbms02
[Www.pkbulk.blogspot.com]dbms02[Www.pkbulk.blogspot.com]dbms02
[Www.pkbulk.blogspot.com]dbms02AnusAhmad
 
Relational_Intro_1.ppt
Relational_Intro_1.pptRelational_Intro_1.ppt
Relational_Intro_1.pptIrfanRashid36
 

Similar to F15CS194Lec03TabularData.pptx (20)

Keyword-based Search and Exploration on Databases (SIGMOD 2011)
Keyword-based Search and Exploration on Databases (SIGMOD 2011)Keyword-based Search and Exploration on Databases (SIGMOD 2011)
Keyword-based Search and Exploration on Databases (SIGMOD 2011)
 
RDBMS Arch & Models
RDBMS Arch & ModelsRDBMS Arch & Models
RDBMS Arch & Models
 
603s129
603s129603s129
603s129
 
Basic IMS For Applications
Basic IMS For ApplicationsBasic IMS For Applications
Basic IMS For Applications
 
Lecture-2 - Relational Model.pptx
Lecture-2 - Relational Model.pptxLecture-2 - Relational Model.pptx
Lecture-2 - Relational Model.pptx
 
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH) Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
 
database Normalization
database Normalizationdatabase Normalization
database Normalization
 
RDBMS Model
RDBMS ModelRDBMS Model
RDBMS Model
 
CSCI 4140-AA Pre-Req.ppt
CSCI 4140-AA Pre-Req.pptCSCI 4140-AA Pre-Req.ppt
CSCI 4140-AA Pre-Req.ppt
 
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docx
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docxL1 Intro to Relational DBMS LP.pdfIntro to Relational .docx
L1 Intro to Relational DBMS LP.pdfIntro to Relational .docx
 
RelationalDBs3.ppt
RelationalDBs3.pptRelationalDBs3.ppt
RelationalDBs3.ppt
 
My sql
My sqlMy sql
My sql
 
Query Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data SourcesQuery Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data Sources
 
R Algebra.ppt
R Algebra.pptR Algebra.ppt
R Algebra.ppt
 
DBMS
DBMSDBMS
DBMS
 
Interaction
InteractionInteraction
Interaction
 
An hour with Database and SQL
An hour with Database and SQLAn hour with Database and SQL
An hour with Database and SQL
 
[Www.pkbulk.blogspot.com]dbms02
[Www.pkbulk.blogspot.com]dbms02[Www.pkbulk.blogspot.com]dbms02
[Www.pkbulk.blogspot.com]dbms02
 
Lecture2-DT.pptx
Lecture2-DT.pptxLecture2-DT.pptx
Lecture2-DT.pptx
 
Relational_Intro_1.ppt
Relational_Intro_1.pptRelational_Intro_1.ppt
Relational_Intro_1.ppt
 

Recently uploaded

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 

Recently uploaded (20)

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 

F15CS194Lec03TabularData.pptx

  • 1. Introduction to Data Science Lecture 3 Manipulating Tabular Data Intro. to Data Science Fall 2015 John Canny including notes from Michael Franklin and others
  • 2. Outline for this Evening • Two views of tables: – SQL/Pandas – OLAP/Numpy/Matlab • SQL, NoSQL – Non-Tabular Structures
  • 3. Data Science – One Definition
  • 5. Two views of tables
  • 7. Key Concept: Structured Data A data model is a collection of concepts for describing data. A schema is a description of a particular collection of data, using a given data model.
  • 8. The Relational Model* • The Relational Model is Ubiquitous: • MySQL, PostgreSQL, Oracle, DB2, SQLServer, … • Foundational work done at • IBM - System R • UC Berkeley - Ingres • Object-oriented concepts have been merged in • Early work: POSTGRES research project at Berkeley • Informix, IBM DB2, Oracle 8i • Also has support for XML (semi-structured data) *Codd, E. F. (1970). "A relational model of data for large shared data banks". Communications of the ACM 13 (6): 37 E. F., “Ted” Codd Turing Award 1981
  • 9. Relational Database: Definitions • Relational database: a set of relations • Relation: made up of 2 parts: Schema : specifies name of relation, plus name and type of each column Students(sid: string, name: string, login: string, age: integer, gpa: real) Instance : the actual data at a given time • #rows = cardinality • #fields = degree / arity • A relation is a mathematical object (from set theory) which is true for certain arguments. • An instance defines the set of arguments for which the relation is true (it’s a table not a row).
  • 10. Ex: Instance of Students Relation sid name login age gpa 53666 Jones jones@cs 18 3.4 53688 Smith smith@eecs 18 3.2 53650 Smith smith @math 19 3.8 • Cardinality = 3, arity = 5 , all rows distinct • The relation is true for these tuples and false for others
  • 11. SQL - A language for Relational DBs* • SQL = Structured Query Language • Data Definition Language (DDL) – create, modify, delete relations – specify constraints – administer users, security, etc. • Data Manipulation Language (DML) – Specify queries to find tuples that satisfy criteria – add, modify, remove tuples • The DBMS is responsible for efficient evaluation. * Developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the 1970s. Used to be SEQUEL (Structured English QUEry Language)
  • 12. Creating Relations in SQL • Create the Students relation. – Note: the type (domain) of each field is specified, and enforced by the DBMS whenever tuples are added or modified. CREATE TABLE Students (sid CHAR(20), name CHAR(20), login CHAR(10), age INTEGER, gpa FLOAT)
  • 13. Table Creation (continued) • Another example: the Enrolled table holds information about courses students take. CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2))
  • 14. Adding and Deleting Tuples • Can insert a single tuple using: INSERT INTO Students (sid, name, login, age, gpa) VALUES ('53688', 'Smith', 'smith@ee', 18, 3.2) • Can delete all tuples satisfying some condition (e.g., name = Smith): DELETE FROM Students S WHERE S.name = 'Smith'
  • 15. Queries in SQL • Single-table queries are straightforward. • To find all 18 year old students, we can write: SELECT * FROM Students S WHERE S.age=18 • To find just names and logins, replace the first line: SELECT S.name, S.login
  • 16. Joins and Inference • Chaining relations together is the basic inference method in relational DBs. It produces new relations (effectively new facts) from the data: SELECT S.name, M.mortality FROM Students S, Mortality M WHERE S.Race=M.Race Name Race Socrates Man Thor God Barney Dinosaur Blarney stone Stone Race Mortality Man Mortal God Immortal Dinosaur Mortal Stone Non-living M S
  • 17. Joins and Inference • Chaining relations together is the basic inference method in relational DBs. It produces new relations (effectively new facts) from the data: SELECT S.name, M.mortality FROM Students S, Mortality M WHERE S.Race=M.Race Name Mortality Socrates Mortal Thor Immortal Barney Mortal Blarney stone Non-living
  • 18. Basic SQL Query • relation-list : A list of relation names • possibly with a range-variable after each name • target-list : A list of attributes of tables in relation-list • qualification : Comparisons combined using AND, OR and NOT. • Comparisons are Attr op const or Attr1 op Attr2, where op is one of =≠<>≤≥ • DISTINCT: optional keyword indicating that the answer should not contain duplicates. • In SQL SELECT, the default is that duplicates are not eliminated! (Result is called a “multiset”) SELECT [DISTINCT] target-list FROM relation-list WHERE qualification
  • 19. SQL Inner Joins SELECT S.name, E.classid FROM Students S (INNER) JOIN Enrolled E ON S.sid=E.sid S.name S.sid Jones 11111 Smith 22222 Brown 33333 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 E S Note the previous version of this query (with no join keyword) is an “Implicit join”
  • 20. SQL Inner Joins SELECT S.name, E.classid FROM Students S (INNER) JOIN Enrolled E ON S.sid=E.sid S.name S.sid Jones 11111 Smith 22222 Brown 33333 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 E S Unmatched keys
  • 21. What kind of Join is this? SELECT S.name, E.classid FROM Students S ?? Enrolled E ON S.sid=E.sid S.name S.sid Jones 11111 Smith 22222 Brown 33333 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 Brown NULL E S
  • 22. SQL Joins SELECT S.name, E.classid FROM Students S LEFT OUTER JOIN Enrolled E ON S.sid=E.sid S.name S.sid Jones 11111 Smith 22222 Brown 33333 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 Brown NULL E S
  • 23. What kind of Join is this? SELECT S.name, E.classid FROM Students S ?? Enrolled E ON S.sid=E.sid S.name S.sid Jones 11111 Smith 22222 Brown 33333 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 NULL English10 E S
  • 24. SQL Joins SELECT S.name, E.classid FROM Students S RIGHT OUTER JOIN Enrolled E ON S.sid=E.sid S.name S.sid Jones 11111 Smith 22222 Brown 33333 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 NULL English10 E S
  • 25. SQL Joins SELECT S.name, E.classid FROM Students S ? JOIN Enrolled E ON S.sid=E.sid S.name S.sid Jones 11111 Smith 22222 Brown 33333 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 NULL English10 Brown NULL E S
  • 26. SQL Joins SELECT S.name, E.classid FROM Students S FULL OUTER JOIN Enrolled E ON S.sid=E.sid S.name S.sid Jones 11111 Smith 22222 Brown 33333 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 NULL English10 Brown NULL E S
  • 27. What kind of Join is this? SELECT S.name, E.classid FROM Students S ?? Enrolled E ON S.sid=E.sid S.name S.sid Jones 11111 Smith 22222 Brown 33333 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 44444 English10 S.name E.classid Jones History105 Smith French150 E S
  • 28. SQL Joins SELECT S.name, E.classid FROM Students S LEFT SEMI JOIN Enrolled E ON S.sid=E.sid S.name S.sid Jones 11111 Smith 22222 Brown 33333 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 44444 English10 S.name E.classid Jones History105 Smith French150 E S
  • 29. What kind of Join is this? SELECT * FROM Students S ?? Enrolled E S.name S.sid Jones 11111 Smith 22222 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 S.name S.sid E.sid E.classid Jones 11111 11111 History105 Jones 11111 11111 DataScience194 Jones 11111 22222 French150 Smith 22222 11111 History105 Smith 22222 11111 DataScience194 Smith 22222 22222 French150 E S
  • 30. SQL Joins SELECT * FROM Students S CROSS JOIN Enrolled E S.name S.sid Jones 11111 Smith 22222 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 S.name S.sid E.sid E.classid Jones 11111 11111 History105 Jones 11111 11111 DataScience194 Jones 11111 22222 French150 Smith 22222 11111 History105 Smith 22222 11111 DataScience194 Smith 22222 22222 French150 E S
  • 31. What kind of Join is this? SELECT * FROM Students S, Enrolled E WHERE S.sid <= E.sid S.name S.sid Jones 11111 Smith 22222 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 S.name S.sid E.sid E.classid Jones 11111 11111 History105 Jones 11111 11111 DataScience194 Jones 11111 22222 French150 Smith 22222 22222 French150 E S
  • 32. Theta Joins SELECT * FROM Students S, Enrolled E WHERE S.sid <= E.sid S.name S.sid Jones 11111 Smith 22222 E.sid E.classid 11111 History105 11111 DataScience194 22222 French150 S.name S.sid E.sid E.classid Jones 11111 11111 History105 Jones 11111 11111 DataScience194 Jones 11111 22222 French150 Smith 22222 22222 French150 E S
  • 33. Recall: Tweet JSON Format {"id"=>12296272736, "text"=> "An early look at Annotations: http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453", "created_at"=>"Fri Apr 16 17:55:46 +0000 2010", "in_reply_to_user_id"=>nil, "in_reply_to_screen_name"=>nil, "in_reply_to_status_id"=>nil "favorited"=>false, "truncated"=>false, "user"=> {"id"=>6253282, "screen_name"=>"twitterapi", "name"=>"Twitter API", "description"=> "The Real Twitter API. I tweet about API changes, service issues and happily answer questions about Twitter and our API. Don't get an answer? It's on my website.", "url"=>"http://apiwiki.twitter.com", "location"=>"San Francisco, CA", "profile_background_color"=>"c1dfee", "profile_background_image_url"=> "http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png", "profile_background_tile"=>false, "profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png", "profile_link_color"=>"0000ff", "profile_sidebar_border_color"=>"87bc44", "profile_sidebar_fill_color"=>"e0ff92", "profile_text_color"=>"000000", "created_at"=>"Wed May 23 06:01:13 +0000 2007", "contributors_enabled"=>true, "favourites_count"=>1, "statuses_count"=>1628, "friends_count"=>13, "time_zone"=>"Pacific Time (US & Canada)", "utc_offset"=>-28800, "lang"=>"en", "protected"=>false, "followers_count"=>100581, "geo_enabled"=>true, "notifications"=>false, "following"=>true, "verified"=>true}, "contributors"=>[3191321], "geo"=>nil, "coordinates"=>nil, "place"=> {"id"=>"2b6ff8c22edd9576", "url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json", "name"=>"SoMa", "full_name"=>"SoMa, San Francisco", "place_type"=>"neighborhood", "country_code"=>"US", "country"=>"The United States of America", "bounding_box"=> {"coordinates"=> [[[-122.42284884, 37.76893497], [-122.3964, 37.76893497], [-122.3964, 37.78752897], [-122.42284884, 37.78752897]]], "type"=>"Polygon"}}, "source"=>"web"} The tweet's unique ID. These IDs are roughly sorted & developers should treat them as opaque (http://bit.ly/dCkppc). Text of the tweet. Consecutive duplicate tweets are rejected. 140 character max (http://bit.ly/4ud3he). Tweet's creation date. DEPRECATED The ID of an existing tweet that this tweet is in reply to. Won't be set unless the author of the referenced tweet is mentioned. The screen name & user ID of replied to tweet author. Truncated to 140 characters. Only possible from SMS. The author of the tweet. This embedded object can get out of sync. The author's user ID. The author's user name. The author's screen name. The author's biography. The author's URL. The author's "location". This is a free-form text field, and there are no guarantees on whether it can be geocoded. Rendering information for the author. Colors are encoded in hex values (RGB). The creation date for this account. Whether this account has contributors enabled (http://bit.ly/50npuu). Number of favorites this user has. Number of tweets this user has. Number of users this user is following. The timezone and offset (in seconds) for this user. The user's selected language. Whether this user is protected or not. If the user is protected, then this tweet is not visible except to "friends". Number of followers for this user. Whether this user has geo enabled (http://bit.ly/4pFY77). DEPRECATED in this context Whether this user has a verified badge. The geo tag on this tweet in GeoJSON (http://bit.ly/b8L1Cp). The contributors' (if any) user IDs (http://bit.ly/50npuu). DEPRECATED The place associated with this Tweet (http://bit.ly/b8L1Cp). The place ID The URL to fetch a detailed polygon for this place The printable names of this place The type of this place - can be a "neighborhood" or "city" The country this place is in The bounding box for this place The application that sent this tweet Map of a Twitter Status Object Raffi Krikorian <raffi@twitter.com> 18 April 2010 {"id"=>12296272736, "text"=> "An early look at Annotations: http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453", "created_at"=>"Fri Apr 16 17:55:46 +0000 2010", "in_reply_to_user_id"=>nil, "in_reply_to_screen_name"=>nil, "in_reply_to_status_id"=>nil "favorited"=>false, "truncated"=>false, "user"=> {"id"=>6253282, "screen_name"=>"twitterapi", "name"=>"Twitter API", "description"=> "The Real Twitter API. I tweet about API changes, service issues and happily answer questions about Twitter and our API. Don't get an answer? It's on my website. The tweet's unique ID. These IDs are roughly sorted & developers should treat them as opaque (http://bit.ly/dCkppc). Text of the tweet. Consecutive duplicate tweets are rejected. 140 character max (http://bit.ly/4ud3he). DEPRECATED The ID of an existing tweet that this tweet is in reply to. Won't be set unless the author of the referenced tweet is mentioned. The screen name & user ID of replied to tweet author. Truncated to 140 characters. Only possible from SMS. . This ut of sync. The author's user ID. The author's user name. The author's screen name.
  • 34. Normalization Raw twitter data storage is very inefficient because, e.g. user records are repeated with every tweet by that user. Normalization is the process of minimizing data redundancy. User.id Name Attribs… 111 Jones 222 Smith Loc.id Name Attribs… 1111 Berkeley 2222 Oakland 3333 Hayward Tweet id User id Location id Body 11 111 1111 I need a Jamba juice 22 111 1111 Cal Soccer rules 33 111 2222 Why do we procrastinate? 44 222 3333 Close your eyes and push “go”
  • 35. Normalization Normalized tables include only a foreign key to the information in another table for repeated data. The original table is the result of inner joins between tables. User.id Name Attribs… 111 Jones 222 Smith Loc.id Name Attribs… 1111 Berkeley 2222 Oakland 3333 Hayward Tweet id User id Location id Body 11 111 1111 I need a Jamba juice 22 111 1111 Cal Soccer rules 33 111 2222 Why do we procrastinate? 44 222 3333 Close your eyes and push “go”
  • 36. Aggregate Queries Including reference counts in the lookup tables allows you to perform aggregate queries on those tables alone: Average age of users, most popular location,… U.id Name Count Attr.. 111 Jones 3 222 Smith 1 L.id Name Count Attr… 1111 Berkeley 2 2222 Oakland 1 3333 Hayward 1 Tweet id User id Location id Body 11 111 1111 I need a Jamba Juice 22 111 1111 Cal Soccer rules 33 111 2222 Why do we procrastinate? 44 222 3333 Close your eyes and push “go”
  • 37. SQL Query Semantics Semantics of an SQL query are defined in terms of the following conceptual evaluation strategy: 1. do FROM clause: compute cross-product of tables (e.g., Students and Enrolled). 2. do WHERE clause: Check conditions, discard tuples that fail. (i.e., “selection”). 3. do SELECT clause: Delete unwanted fields. (i.e., “projection”). 4. If DISTINCT specified, eliminate duplicate rows. Probably the least efficient way to compute a query! – An optimizer will find more efficient strategies to get the same answer.
  • 38. 38 Data Model (Tabular) • SQLite – Table: fixed number of named columns of specified type – 5 storage classes for columns • NULL • INTEGER • REAL • TEXT • BLOB – Data stored on disk in a single file in row-major order – Operations performed via sqlite3 shell
  • 39. 39 Reductions and GroupBy • One of the most common operations on Data Tables is aggregation or reduction (count, sum, average, min, max,…). • They provide a means to see high-level patterns in the data, to make summaries of it etc. • You need ways of specifying which columns are being aggregated over, which is the role of a GroupBy operator. SID Name Course Semester Grade GPA 111 Jones Stat 134 F13 A 4.0 111 Jones CS 162 F13 B- 2.7 222 Smith EE 141 S14 B+ 3.3 222 Smith CS162 F14 C+ 2.3 222 Smith CS189 F14 A- 3.7
  • 40. 40 Reductions and GroupBy SID Name Course Semester Grade GPA 111 Jones Stat 134 F13 A 4.0 111 Jones CS 162 F13 B- 2.7 222 Smith EE 141 S14 B+ 3.3 222 Smith CS162 F14 C+ 2.3 222 Smith CS189 F14 A- 3.7 SELECT SID, Name, AVG(GPA) FROM Students GROUP BY SID
  • 41. 41 Reductions and GroupBy SID Name Course Semester Grade GPA 111 Jones Stat 134 F13 A 4.0 111 Jones CS 162 F13 B- 2.7 222 Smith EE 141 S14 B+ 3.3 222 Smith CS162 F14 C+ 2.3 222 Smith CS189 F14 A- 3.7 SELECT SID, Name, AVG(GPA) FROM Students GROUP BY SID SID Name GPA 111 Jones 3.35 222 Smith 3.1
  • 42. 42 Pandas/Python • Series: a named, ordered dictionary – The keys of the dictionary are the indexes – Built on NumPy’s ndarray – Values can be any Numpy data type object • DataFrame: a table with named columns – Represented as a Dict (col_name -> series) – Each Series object represents a column
  • 43. 43 Operations • map() functions • filter (apply predicate to rows) • sort/group by • aggregate: sum, count, average, max, min • Pivot or reshape • Relational: – union, intersection, difference, cartesian product (CROSS JOIN) – select/filter, project – join: natural join (INNER JOIN), theta join, semi-join, etc. – rename
  • 44. 44 Pandas vs SQL + Pandas is lightweight and fast. + Full SQL expressiveness plus the expressiveness of Python, especially for function evaluation. + Integration with plotting functions like Matplotlib. - Tables must fit into memory. - No post-load indexing functionality: indices are built when a table is created. - No transactions, journaling, etc. - Large, complex joins probably slower.
  • 45. Jacobs Update • Room 310 not ready, but other rooms are. For the next few (?) weeks, we will meet: • Mondays in 155 Donner • Wednesdays in 110/120 Jacobs Hall Starting this Weds.
  • 47. The other view of tables: OLAP • OnLine Analytical Processing • Conceptually like an n-dimensional spreadsheet (Cube) • (Discrete) columns become dimensions • The goal is live interaction with numerical data for business intelligence
  • 48. The other view of tables: OLAP From a table to a cube: name classid Semester Grade Units Jones History105 F13 3.3 4.0 Jones DataScience194 S12 4.0 3.0 Jones French150 F14 3.7 4.0 Smith History105 S15 2.3 3.0 Smith DataScience194 F14 2.7 3.0 Smith French150 F13 3.0 4.0
  • 49. From tables to OLAP cubes From a table to a cube: name classid Semester Grade Units Jones History105 F13 3.3 4.0 Jones DataScience194 S12 4.0 3.0 Jones French150 F14 3.7 4.0 Smith History105 S15 2.3 3.0 Smith DataScience194 F14 2.7 3.0 Smith French150 F13 3.0 4.0 Variables used as qualifiers (In where, GroupBy clauses) Normally discrete Variables we want to measure Normally numeric
  • 50. Constructing OLAP cubes name classid Semester Grade Units Jones History105 F13 3.3 4.0 Jones DataScience194 S12 4.0 3.0 … … … … … Name Classid Semester Cell contents are Grade, Unit values Cube dimensions Cube values
  • 51. Queries on OLAP cubes • Once the cube is defined, its easy to do aggregate queries by projecting along one or more axes. • E.g. to get student GPAs, we project the Grade field onto the student (Name) axis. • In fact, such aggregates are precomputed and maintained automatically in an OLAP cube, so queries are instantaneous. Name Classid Semester Cell contents are Grade, Unit values
  • 52. OLAP • Slicing: fixing one or more variables • Dicing: selecting a range of values for one or more variables
  • 53. OLAP • Drilling Up/Down (change levels of a hierarchically-indexed variable) • Pivoting: produce a two-axis view for viewing as a spreadsheet.
  • 54. Outline • To support real-time querying, OLAP DBs store aggregates of data values along many dimensions. • This works best if axes can be tree-structured. E.g time can be expressed as a hierarchy hour  day  week  month  year
  • 55. OLAP tradeoffs • Aggregates increase space and the cost of updates. • On the other hand, since they are projections of data, or tree structures, the storage overhead can be small. • Aggregates are limited, but cover a lot of common cases: avg, stdev, min, max. • Operations (slice, dice, pivot, etc.) are conceptually simpler than SQL, but cover a lot of common cases. • Good integration with clients, e.g. spreadsheets, for visual interaction, although there is an underlying query language (MDX).
  • 56. Numpy/Matlab and OLAP • Numpy and Matlab have an efficient implementation of nd- arrays for dense data. • Indices must be integer, but you can implement general indices using dictionaries from indexval->int. • Slicing and dicing are available using index ranges: a[5,1:3,:] etc. • Roll-down/up involve aggregates along dimensions such as sum(a[3,4:6,:],2) • Pivoting involves index permutations (.transpose()) and aggregation over the other indices. • Limitation: MATLAB and Numpy currently only support dense nd-arrays (or sparse 2d arrays).
  • 57. What’s Wrong with Tables? • Too limited in structure? • Too rigid? • Too old fashioned?
  • 58. What’s Wrong with (RDBMS) Tables? • Indices: Typical RDBMS table storage is mostly indices – Cant afford this overhead for large datastores • Transactions: – Safe state changes require journals etc., and are slow • Relations: – Checking relations adds further overhead to updates • Sparse Data Support: – RDBMS Tables are very wasteful when data is very sparse – Very sparse data is common in modern data stores – RDBMS tables might have dozens of columns, modern data stores might have many thousands.
  • 59. RDBMS tables – row based Table: Represented as: sid name login age gpa 53831 Jones jones@cs 18 3.4 53831 Smith smith@ee 18 3.2 53831 Jones jones@cs 18 3.4 53831 Smith smith@ee 18 3.2
  • 60. Tweet JSON Format {"id"=>12296272736, "text"=> "An early look at Annotations: http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453", "created_at"=>"Fri Apr 16 17:55:46 +0000 2010", "in_reply_to_user_id"=>nil, "in_reply_to_screen_name"=>nil, "in_reply_to_status_id"=>nil "favorited"=>false, "truncated"=>false, "user"=> {"id"=>6253282, "screen_name"=>"twitterapi", "name"=>"Twitter API", "description"=> "The Real Twitter API. I tweet about API changes, service issues and happily answer questions about Twitter and our API. Don't get an answer? It's on my website.", "url"=>"http://apiwiki.twitter.com", "location"=>"San Francisco, CA", "profile_background_color"=>"c1dfee", "profile_background_image_url"=> "http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png", "profile_background_tile"=>false, "profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png", "profile_link_color"=>"0000ff", "profile_sidebar_border_color"=>"87bc44", "profile_sidebar_fill_color"=>"e0ff92", "profile_text_color"=>"000000", "created_at"=>"Wed May 23 06:01:13 +0000 2007", "contributors_enabled"=>true, "favourites_count"=>1, "statuses_count"=>1628, "friends_count"=>13, "time_zone"=>"Pacific Time (US & Canada)", "utc_offset"=>-28800, "lang"=>"en", "protected"=>false, "followers_count"=>100581, "geo_enabled"=>true, "notifications"=>false, "following"=>true, "verified"=>true}, "contributors"=>[3191321], "geo"=>nil, "coordinates"=>nil, "place"=> {"id"=>"2b6ff8c22edd9576", "url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json", "name"=>"SoMa", "full_name"=>"SoMa, San Francisco", "place_type"=>"neighborhood", "country_code"=>"US", "country"=>"The United States of America", "bounding_box"=> {"coordinates"=> [[[-122.42284884, 37.76893497], [-122.3964, 37.76893497], [-122.3964, 37.78752897], [-122.42284884, 37.78752897]]], "type"=>"Polygon"}}, "source"=>"web"} The tweet's unique ID. These IDs are roughly sorted & developers should treat them as opaque (http://bit.ly/dCkppc). Text of the tweet. Consecutive duplicate tweets are rejected. 140 character max (http://bit.ly/4ud3he). Tweet's creation date. DEPRECATED The ID of an existing tweet that this tweet is in reply to. Won't be set unless the author of the referenced tweet is mentioned. The screen name & user ID of replied to tweet author. Truncated to 140 characters. Only possible from SMS. The author of the tweet. This embedded object can get out of sync. The author's user ID. The author's user name. The author's screen name. The author's biography. The author's URL. The author's "location". This is a free-form text field, and there are no guarantees on whether it can be geocoded. Rendering information for the author. Colors are encoded in hex values (RGB). The creation date for this account. Whether this account has contributors enabled (http://bit.ly/50npuu). Number of favorites this user has. Number of tweets this user has. Number of users this user is following. The timezone and offset (in seconds) for this user. The user's selected language. Whether this user is protected or not. If the user is protected, then this tweet is not visible except to "friends". Number of followers for this user. Whether this user has geo enabled (http://bit.ly/4pFY77). DEPRECATED in this context Whether this user has a verified badge. The geo tag on this tweet in GeoJSON (http://bit.ly/b8L1Cp). The contributors' (if any) user IDs (http://bit.ly/50npuu). DEPRECATED The place associated with this Tweet (http://bit.ly/b8L1Cp). The place ID The URL to fetch a detailed polygon for this place The printable names of this place The type of this place - can be a "neighborhood" or "city" The country this place is in The bounding box for this place The application that sent this tweet Map of a Twitter Status Object Raffi Krikorian <raffi@twitter.com> 18 April 2010 {"id"=>12296272736, "text"=> "An early look at Annotations: http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453", "created_at"=>"Fri Apr 16 17:55:46 +0000 2010", "in_reply_to_user_id"=>nil, "in_reply_to_screen_name"=>nil, "in_reply_to_status_id"=>nil "favorited"=>false, "truncated"=>false, "user"=> {"id"=>6253282, "screen_name"=>"twitterapi", "name"=>"Twitter API", "description"=> "The Real Twitter API. I tweet about API changes, service issues and happily answer questions about Twitter and our API. Don't get an answer? It's on my website. The tweet's unique ID. These IDs are roughly sorted & developers should treat them as opaque (http://bit.ly/dCkppc). Text of the tweet. Consecutive duplicate tweets are rejected. 140 character max (http://bit.ly/4ud3he). DEPRECATED The ID of an existing tweet that this tweet is in reply to. Won't be set unless the author of the referenced tweet is mentioned. The screen name & user ID of replied to tweet author. Truncated to 140 characters. Only possible from SMS. . This ut of sync. The author's user ID. The author's user name. The author's screen name.
  • 61. RDBMS tables – row based Table: Represented as: ID name login loc locid LAT LONG ALT State 52841 Jones jones@cs NULL NULL NULL NULL NULL NULL 53831 Smith smith@ee NULL NULL NULL NULL NULL NULL 55541 Brown brown@ee NULL NULL NULL NULL NULL NULL 53831 Smith smith@ee NULL NULL NULL NULL NULL NULL 52841 Jones jones@cs NULL NULL NULL NULL NULL NULL 55541 Brown brown@ee NULL NULL NULL NULL NULL NULL
  • 62. Column-based store Table: Represented as column (key-value) stores: ID name login loc locid LAT LONG ALT State 52841 Jones jones@cs Albany 2341 38.4 122.7 100 CA 53831 Smith smith@ee NULL NULL NULL NULL NULL NULL 55541 Brown brown@ee NULL NULL NULL NULL NULL NULL ID name 52841 Jones 53831 Smith 55541 Brown ID login 52841 jones@cs 53831 smith@ee 55541 brown@ee ID loc 52841 Albany ID locid 52841 2341 ID LAT 52841 38.4 ID LONG 52841 122.7 …
  • 64. Column-Family Stores (Cassandra) A column-family groups data columns together, and is analogous to a table (and similar to Pandas DataFrame) Static column family from Apache Cassandra: Dynamic Column family (Cassandra): Can add or remove columns from a dynamic column family Columns fixed
  • 65. CouchDB Data Model (JSON) • “With CouchDB, no schema is enforced, so new document types with new meaning can be safely added alongside the old.” • A CouchDB document is an object that consists of named fields. Field values may be: – strings, numbers, dates, – ordered lists, associative maps 66 "Subject": "I like Plankton" "Author": "Rusty" "PostedDate": "5/23/2006" "Tags": ["plankton", "baseball", "decisions"] "Body": "I decided today that I don't like baseball. I like plankton."
  • 66. Key-value stores • A key-value store is an even simpler approach. • It implements storage and retrieval of (key,value) pairs. • i.e. Basic functionality is that of a dictionary age[“john”] = 25. • But some KV-stores also implement sorting and indexing with the keys (e.g. leveldb). • You can build either column-based or row-based DBs on top of such KV-stores to optimize performance (e.g. omitting indices or ACID qualities). 67
  • 67. Pig • Started at Yahoo! Research • Features: – Expresses sequences of MapReduce jobs – Data model: nested “bags” of items • Schema is optional – Provides relational (SQL) operators (JOIN, GROUP BY, etc) – Easy to plug in Java functions
  • 68. An Example Problem Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 69. In MapReduce Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 70. Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; In Pig Latin Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 71. Hive • Developed at Facebook • Relational database built on Hadoop – Maintains table schemas – SQL-like query language (which can also call Hadoop Streaming scripts) – Supports table partitioning, complex data types, sampling, some query optimization • Used for most Facebook jobs – Less than 1% of daily jobs at Facebook use MapReduce directly!!! (SQL – or PIG – wins!) – Note: Google also has several SQL-like systems in use.
  • 72. Summary • Two views of tables: – SQL/Pandas – OLAP/Numpy/Matlab • SQL, NoSQL – Non-Tabular Structures Wednesday come to 110/120 Jacobs Hall for Pandas Lab