Albert Wong, Manager of DSE Reporting Platform at Netflix
Today talk about how we’ve unlocked big data at Netflix
First, give overview of our overall data architecture, how it’s evolved in recent years
You’ll get an idea of the big data technologies we use and why
Then, we’ll dive into MicroStrategy
Our experience in connecting it to big data
Talk about how we’ve fostered an environment that allows for quick iterative development to keep up with our rapidly changing business
At the end of the session, hopefully you’ll walk away with ideas you can use in your own environment
Let’s start with our data science platform
These are technologies we’ve used for a while
Oracle was the backend database for our site
Abinitio was our ETL tool
Teradata our data warehouse
MicroStrategy for reporting
These are all best of breed tools that we’ve used for a while and still use today …
With our growing customer base
And appetite for international expansion
We were due for a massive hardware upgrade
And so, we shifted our tech infrastructure to the Amazon cloud
On data science side, we needed something that could handle the increased data volumes
Hadoop great fit, b/c
Framework for processing large data sets
Designed to scale
If need to process more data
Just add more servers
Then get more computing power
Layer on top of hadoop which makes analyzing large datasets easier
You can submit queries similar to SQL against large datasets and expect results returned in a structured format
We use it to create summaries which are pushed down to our data warehouse Teradata
Honu is our event data pipeline
Branch out of an open source technology called Chukwa that came out of Yahoo and we’ve made a few modifications tailored for Netflix
Information like play requests, movies displayed to user screens, and searches all flow through this pipeline
It can handle billions of these types of requests daily
With that, we needed some place to store all this data
S3 – essentially an infinitely scalable data store in the cloud and serves as our data warehouse source of truth
Add slide
Cassandra provided the backbones of getting from data center to the cloud and expanding internationally, Cassandra is our Oracle replacement
It’s distributed, syncs across clusters in different data centers
Extremely fast in reads, and writes
Where our dimension data comes from
Python Programming language
We’ve built libraries which allow us to integrate some of our technologies
Specifically in this architecture, we use it to define business rules in our ETL
Pig, another interface which makes it easy to process data through hadoop. We use it to extract and transform data in S3
Open source stats package
We use for advanced analytics, predictive modeling
One way in which we use it is to create algorithms, formulas for recommending movies or shows based on past behavior
Let’s review those technologies in the overall architecture
We have cassandra which is the new backbone for website
We have pig and python which extract Cassandra logs transforms that information into something structured for analytics
R for advanced analytics
Lot to digest, but if you’ve been following along, there is an arrow that should be on the picture
Can anyone guess what that is?
*Hint: goes from cloud to data center
One thing we experimented with was getting MSTR connected directly with hive
A standard way for 3rd party tools to connect to hive was to set up a hive server
Communication to the hive server is through a thrift protocol
We initially used an ODBC driver that implemented the protocol
We then switched to a native thrift connector that MicroStrategy had begun to support and it was easier to work with. Error messages much easier to debug
One interesting finding:
MicroStrategy would sometimes generate invalid queries
MicroStrategy used its SQL engine to generate valid SQL syntax
Though Hive QL is a language based on SQL, it does not support all SQL constructs
So we worked with MicroStrategy to patch its SQL engine to fix those types of queries
Queries like the one above were re-written
After collaborating with MSTR for about half a year on finding and resolving these issues, we felt the connector was stable enough to use in reporting
Through our testing, we had a sense of how slow we would have to sit and wait for queries to return
So adhoc reporting was out of the question
We knew our reports would have to be either cached or emailed
Well, time came for us to expand our streaming services to Norway, Sweden, Finland, and Denmark and everyone in the company wanted to see signup numbers by the hour
So, we thought, this would be a good opportunity to put the hive connector to a real world test
May be a good slide to include actually
Well, we did and, above is an example of a graph we used in our hourly emailed document
Developing and generating one of those graphs was a painful process and we had several in this document
If you made a mistake in generating your dataset it was possible to spend your next hour waiting for your report to finish executing. A good portion of your day could have been spent doing this if were planning to iterate on developing your report. This was by far the most painful part of doing this and once the hourly reports went live, tweaks were extremely difficult to make on the fly before it’s next run the following hour
If I had to do it over again, it would be worth the investment to create a hive summary job and push that data into our data warehouse
For adhoc hive querying, another group data science and engineering created a lightweight web reporting tool allowing you to submit hive queries and return with basic visualizations
Rounds out our overall architecture
Next we’ll dive into mstr setup. I’ll touch up on some of the key MicroStrategy features that has worked well for us
How we’ve supplemented MSTR to fill in gaps on missing features
And talk about how we unlocked our reporting environment for flexibility and speed of development
We’ve just touched up on thrift and how it has allowed us to plug directly into our data in the cloud
So I’ll start with multi-source
Multi-source gives us the option to connect to multiple data sources, Oracle, Teradata, and now Hive from the same project.
It provides a convenient way of querying data from different sources and joining the data on shared dimensions.
We currently use it to query sql server, which keeps track of report statistics and, Teradata, which keeps track of query statistics and bring both sets of data together into a single dashboard without the ETL step. This has allowed us to more easily identify reports in need of tuning, and ultimately take some of the load off our data warehouse
Intelligent cubes are structures that hold data in RAM and provide speedy access times for our users interacting with cube-based reports or dashboards.
How it works is, you load up a set of data onto the Intelligence server and then build a report or dashboard on top of that
When that report or dashboard is accessed, it bypasses the step of querying our data warehouse and immediately fetches data directly from server memory.
You can think of it as a report cache, with better scaling. We built a dashboard off a 300 MB report cache and later off of a cube and the dashboard running on top of the cube was significantly faster.
Selectors were more responsive
Providing and overall better experience
http://depositphotos.com/3984024/stock-illustration-Geometric-Cube.html?sst=0&sqc=13&sqm=79730&sq=a3cox
http://depositphotos.com/6275653/stock-illustration-Graduation-mortar-board-hat.html?sst=60&sqc=121&sqm=11889&sq=a5mch
Or could just do cube with no graduation cap…
Multisource
Open environment
http://depositphotos.com/8198593/stock-photo-Open-white-door-to-the-meadows.html?sst=0&sqc=7&sqm=10386&sq=auyb5
I brightened the pic a bit
Division of work / collaboration slide:
http://depositphotos.com/search/gears.html
http://depositphotos.com/4222694/stock-illustration-Gear-system.html?sst=0&sqc=0&sqm=56376&sq=g2d9p
Consists of one project and a team of 50 developers covering all business subject areas.
At Netflix we are about creating an open environment
so we designed our reporting environment accordingly
Early on, Netflix pioneered a read only architect license feature with MSTR
Did this so architects could read schema without blocking others
Eventually we later converted to full fledged architect licenses for all developers
Having all these architects led to lots of metadata corruption issues initially, but worked through these issues with MicroStrategy
Developers split up their work by focusing their own subject area (for example, some specialize on movie information and others on streaming experience, and they coordinate work whenever there is cross-over)
In addition, we have a weekly MSTR forum to discuss best practices
we have a semi-formal schema development calendar as a way of preventing contention
One thing our developers enjoy is our twice a day migration schedule, we use a shortcut system like this …
Once you drop in your shortcut, you are done
And we pick up the process of productionalizing development work from there
There is no process to go through
We do not check developer work
There is no qa environment
Changes are seen by immediately
This provides for quick iterative development and also allows us to easily scale
Whether we have more developers or less in our system, our process is the same, we manage one project and it is self serve