Your SlideShare is downloading. ×
  • Like
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

  • 623 views
Published

Watch this presentation by Andrei Yurkevich, Altoros's President and CTO, to know what are the main challenges causing a big data project fail. Reveal a strategy that can help you to mitigate risks …

Watch this presentation by Andrei Yurkevich, Altoros's President and CTO, to know what are the main challenges causing a big data project fail. Reveal a strategy that can help you to mitigate risks when planning a large-scale long-term project. Enjoy vivid examples that show the mistakes Altoros made and learn how all the issues were overcome with a prototype.

See more at http://blog.altoros.com/big-data-analytics-2013-in-london.html

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
623
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
40
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • VolumeVelocityVarietyWhere to start?
  • Everything seemed to be smooth. However, there was just one slight detail about MySQL Cluster. Its architecture requires putting all data into RAM, so we needed a cluster that would have 2.5 TB of RAM. The actual deployment cost was about $500 up the budget. So, we had to start from scratch again.
  • HBase was 2 seconds faster than Cassandra but what about fault tolerance? HBase has additional node that serves as a coordinator for the entire system. If it fails – the system fails. Surely we can add a secondary management node, but then we may exceed the budget. Cassandra has decentralized architecture it means that all nodes of its cluster have equal roles and every node can serve as a coordinator. It makes this database extremely fault tolerant. 
  • raw data – is all data that comes from sensorsprocessed data – is the data that was aggregated for each 10 minutes. This data is used for building reports.

Transcript

  • 1. © ALTOROS Systems | CONFIDENTIAL Andrei Yurkevich Chief Technology Officer andrei.yurkevich@altoros.com
  • 2. © ALTOROS Systems | CONFIDENTIAL 2 • Hadoop/NoSQL performance engineering • Cluster Automation & Server Templates on Joyent, AWS, SoftLayer, Rackspace, CloudStack and OpenStack using Chef/Puppet, RightScale and SCALR • 300+ employees globally (UK, USA, Denmark, Switzerland, Norway, Belarus, Argentina) • v Featured customers Partners
  • 3. © ALTOROS Systems | CONFIDENTIAL 3
  • 4. © ALTOROS Systems | CONFIDENTIAL 4
  • 5. © ALTOROS Systems | CONFIDENTIAL 56 Combinations
  • 6. © ALTOROS Systems | CONFIDENTIAL 56 Combinations 15625
  • 7. © ALTOROS Systems | CONFIDENTIAL 7
  • 8. © ALTOROS Systems | CONFIDENTIAL 8 No clear business goals Big amounts of data from many sources Architecture design The variety of tools Compatibility of technologies/platforms Lack of professionals All features in one release Budget
  • 9. © ALTOROS Systems | CONFIDENTIAL 9
  • 10. © ALTOROS Systems | CONFIDENTIAL 10 Functional requirements Value Non-functional requirements The amount of data added daily: 2.5 TB • Infrastructure-independent architecture • Scalability • Open-source tools Data type:  raw data  processed data Data storage time:  raw data  Processed data  min a week  min a year Response time:  for building reports based on a pre-set template  for building reports for a custom period of time  < 30 sec  < 6 hours Uptime: 99% Fault-tolerance: required Deployment cost per day: < $1,000
  • 11. © ALTOROS Systems | CONFIDENTIAL 11 Amazon AWS Joyent Rackspace Types of a contract On Demand, Reserved, Spot On Demand, Reserved On Demand Types of instances (classified by compute units) • General Purpose • Compute optimized • Memory optimized • Storage optimized • Standard • High Memory • High CPU • High Storage • High I/O • General Purpose Storage options • EBS • S3 • Low-cost storage • Network storage based on ZFS • Cloud Block Storage • Cloud Files Operating systems Linux, Windows SmartOS, Linux, Windows Linux, Windows A management console AWS Console Joyent SmartDataCenter Cloud Control Panel A Cloud API • Command line interface • Java, .NET, Ruby SDK and API • Command line interface (CLI) • Node.js SDK • REST API REST API Regions America, Europe, Asia, Australia North America, Europe America, Europe, Asia, Australia Estimated cost per month $18,300 $17,500 $21,350
  • 12. © ALTOROS Systems | CONFIDENTIAL 12 a good fit a normal fit a bad fit Option 2 Option 1 Feature Amazon AWS Joyent Rackspace Types of a contract On Demand, Reserved, Spot On Demand, Reserved On Demand Types of instances (classified by compute units) • General Purpose • Compute optimized • Memory optimized • Storage optimized • Standard • High Memory • High CPU • High Storage • High I/O • General Purpose Storage options • EBS • S3 • Low-cost storage • Network storage based on ZFS • Cloud Block Storage • Cloud Files Operating systems Linux, Windows SmartOS, Linux, Windows Linux, Windows A management console AWS Console Joyent SmartDataCenter Cloud Control Panel A Cloud API • Command line interface • Java, .NET, Ruby SDK and API • Command line interface (CLI) • Node.js SDK • REST API REST API Regions America, Europe, Asia, Australia North America, Europe America, Europe, Asia, Australia Estimated cost per month $18,300 $17,500 $21,350 Score 1.5 3.5
  • 13. © ALTOROS Systems | CONFIDENTIAL 13 Features HBase Cassandra MongoDB MySQL Cluster License Apache Apache AGPL GPL Protocol HTTP/REST (also Thrift) Thrift and custom binary CQL3 Custom, binary (BSON) JDBC, ODBC Data model Column family Column family JSON documents Tables Queries / Query Language JRuby-based (JIRB) shell Cassandra Query Language JavaScript expressions SQL Partitioning Strategy Ordered Partitioning Random Partitioning Sharding by key Partition by key Replication between nodes yes yes yes yes Replication between data centers no yes no yes Capability to store 2.5 TB daily yes yes yes yes Implementation Experience 1+ 1+ 2+ 5+ Score 2 3 2 5 a good fit a normal fit a bad fit
  • 14. © ALTOROS Systems | CONFIDENTIAL 14 Features HBase Cassandra MongoDB MySQL Cluster License Apache Apache AGPL GPL Protocol HTTP/REST (also Thrift) Thrift and custom binary CQL3 Custom, binary (BSON) JDBC, ODBC Data model Column family Column family JSON documents Tables Queries / Query Language JRuby-based (JIRB) shell Cassandra Query Language JavaScript expressions SQL Partitioning Strategy Ordered Partitioning Random Partitioning Sharding by key Partition by key Replication between data centers no yes no yes Capability to store 2.5 TB daily yes yes yes yes Implementation Experience 1+ 1+ 2+ 5+ Deployment cost per day $450 $400 $500 $1,500 Score 2.5 4 2.5 0 a good fit a normal fit a bad fit
  • 15. © ALTOROS Systems | CONFIDENTIAL 15
  • 16. © ALTOROS Systems | CONFIDENTIAL 16 Feature HBase Cassandra MongoDB Replication between data centers Asynchronous, needs testing Replicas can span data centers with synchronous replication Not supported A cluster admin node NameNode Any node mongos process Implementation Experience 1+ 1+ 2+ Time spent on inserting 30 MB of data 7 sec 9 sec 20 sec Deployment cost per day $450 $400 $500 Score 2 2.5 0 a good fit a normal fit a bad fit
  • 17. © ALTOROS Systems | CONFIDENTIAL 17
  • 18. © ALTOROS Systems | CONFIDENTIAL 18
  • 19. © ALTOROS Systems | CONFIDENTIAL 19 A requirement The prototype features Storing of 2.5 TB of daily raw data for a week Capable Storing of 1.5 TB of processed data for a year Capable Response time for building reports based on a pre-set template ~25 sec Response time of less than 6 hours for building a custom report ~7 hours Scalability Good Infrastructure Independence Yes Using open-source tools For all components Fault-tolerance Yes Deployment cost per day < $1,000 ~$600
  • 20. © ALTOROS Systems | CONFIDENTIAL Properly visualize and test the functionality Detect bottlenecks and change a technology/tool/database before it was implemented in the real system Get a real vision of the final solution Make sure you stick to the budget 20
  • 21. © ALTOROS Systems | CONFIDENTIAL 21 Andrei Yurkevich President/CTO andrei.yurkevich@altoros.com