16. select
dateint,
hour,
count(distinct other_properties['CUSTOMER_ID']) signups
from default.start_membership_event
where other_properties['SIGNUP_COUNTRY']='NL'
and other_properties['IS_TESTER']='false'
and ( dateint >= 20130911 OR (dateint = 20130910 AND hour >= 22) )
and ( dateint <= 20130916 OR (dateint = 20130917 AND hour < 22) )
group by
dateint,
hour
Editor's Notes
Hi, my name is Albert Wong
I am in charge of the reporting platform at Netflix
And today, we'll take a look at how we've set up EMR,
Overview where it's used in our data platform,
And go over how MicroStrategy plugs into this architecture
Lastly, we'll end with a demo, illustrating how we get data from EMR to MSTR
If you are new to Netflix, we are a tv/movie content streaming business
When you log in, we display content you can watch on demand
If you're a kid, we provide a section for kids content next to 'watch instantly' at the top
And, we also have a DVD service
Now, clicking on one of the shows below does 2 things
First, you get to watch the show
Second, we, at the data platform, get data to analyze
Simplified view of our of data pipeline
Full view of our data platform
Complicated at first glance, but easy to understand when we take the time to break it down
Let’s break it down
Data platform when we were mainly a DVD company
Infinite hard drive
Low cost to store our data reliably
Stands for simple storage service
So what do we store?
Our event data pipeline
We get streaming events (e.g. when you start a movie/show, stop, pause, resume)
Branch off of similar technology called Chukwa conceived at Yahoo!, then modified to meet our needs
Where we store our dimension data (e.g. titles, user accounts)
Open source dbms, distributed
reads quickly, writes, quickly
And we’ve replaced Oracle with it
Review
Keep in mind, we eventually want this data to be reported out of MSTR
Needed something to process our increased data volumes
Hadoop fit the bill
It’s a framework for processing large data sets
Full discussion on how hadoop works out of scope for today
But one thing to highlight is it’s designed to scale – if we need to process more data, we just add more servers
And EMR allows us to add more servers with relative ease
Pig is an interface which makes it easy to write code to be executed within the hadoop framework to extract data
Python is a high level programming language that we use to aid in data transformation
Hive, like pig makes it easy to write code to execute code within the hadoop framework
The language resembles SQL
We use it for adhoc queries and creating aggregate/summary tables
Review once again, Cassandra, and Honu data land in S3
For Cassandra, have an intermediary data extraction step before S3
On the right, we have hive, pig and python being used to process and aggregate data for reporting
We then move that down to Teradata and then into MicroStrategy for reporting
Lots of steps, what if we want to just explore data without processing, skip the ETL process?
Spin up a hive server in EMR
Configure MicroStrategy to talk to it
Query data directly out of AWS