Our company built a system mixing Big Data technologies (hadoop/ElasticSearch) along with SQL Server to make a system that is both highly scalable and cost effective. In this session I will discuss the reasons we went this route, the pros/cons of going down this path, I'll discuss moving from hive to spark and what we learned. We have been running our platform for 2+ years in the Big Data space and have lots of failures to share with others.
2. About Me
• 12+ years working with data
• 4 failed DW attempts
• 2 failed Data architectures
• Avid Volunteer
Blog: www.Sqlasylum.com
Twitter: @SQLAsylum
LinkedIn:
https://www.linkedin.com/in/patwright
Email: Sqlasylum@gmail.com
Speaker Rate Link:
3. What is Big Data?
• What is #BigData?
• Volume
• Variety
• Velocity
• Value
• Old Idea and concept new tools.
3
“Big data is like teen sex. Everybody is talking about
it, everyone thinks everyone else is doing it, so
everyone claims they are doing it.”
--Dan Ariely
4. Why are we talking about Big data
• Right Tool for the Job!
• DW took to long
• Cost too much
• Was not flexible
• Couldn’t scale without lots more $$$$$
4
5. • VOCI –Voice of the Customer Intelligence
• Manage and improve the customer experience to retain more customers
• SAAS based, must have a repeatable process for all customers.
Problem
• Slow site performance when querying data
• Not Scalable
• Repeatable except for large scale customers
• Dynamic SQL generating all the filtering/sql statements
• Querying from the read/Write OLTP store.
10. Lesson Learned…why you really came to
this session
• Spark is Awesome… sort of
• Hbase is less than awesome.
• Scala is just a teeny bit faster than python…..According to Netflix
• Scala developers are hard to find.
• Memory is crucial for Spark
• Versions suck!
• Kafka is pretty awesome.
• This just in! Versions suck (ambari 2.4 is your best friend)
Great Explanation of Map Reduce
http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/
Dan Ariely quote taken from.
Read more: http://medcitynews.com/2013/11/big-data-like-teen-sex-memorable-quotes-digital-health-innovation-summit/#ixzz35bDHaMv7
Various Data coming in,
Normalized DB lots of tables/Procedures
Read/Write all in one place.
Steps to make it faster Lots of indexes to support the reporting platform.
Cons to replication
Overhead with replication
Maintenance aspect with monthly releases.
Changes to production systems that would be needed (tables without PK)
Dependency on SQL Server and licensing costs.
Cons to cubes
Cost
Scalability
Time to process/delay/hardware costs