Set up Google Cloud Platform ~ Allen‘ Blog

Set up Google Cloud Platform

Posted by AAllen On Thursday, May 2, 2013 0 comments

How it works

The Google Cloud platform is the search engine's offering to build applications, websites, store and analyze data. It operates on a 'pay as you use' model and also offers a series of services that are directed toward solving big data problems.

The biggest appeal to using the Google Cloud platform is that it runs on Google's technology stack, the same one used by the search engine to run many of their high-traffic big data applications (e.g. Mail, Analytics, Maps, etc). This guarantees service levels offered by a handful of providers -- in terms of availability & scalability -- as well as access to leading-edge technology which in some cases is only available at Google.

Getting started

The first thing you need to have is a Google account. If at some point in time you've used any Google service, then you already have a Google account. You can sign in to your Google account, as well as recover access to it -- in case you forgot your password -- at: https://accounts.google.com/ServiceLogin .

If you don't have a Google account, you can sign up for an account at https://accounts.google.com/NewAccount . This is what you'll need for the sign up:

Access to your email
Provide a password and your birthday
Confirm a captcha and agree to the terms of service

The Google APIs Console is where you'll manage the majority of projects running on the Google Cloud platform. A project is a collection of information about an application, that includes such things as: authentication information, team members email addresses and the Google APIs an application uses. You can create your own projects, be added as a viewer or be a developer for projects created by other people with Google accounts. In addition, the Google APIs console is also where you'll manage and view traffic data for a project, as well as administer billing information for any 'pay as you use' service quotas.

Google Cloud platform services

Google Compute Engine.- Is a virtualized server running Linux -- Ubuntu or CentOS -- entirely managed by Google. You get absolute control of the server operating system (OS), like any other virtual server and you also gain access to most of the features offered by large virtual server providers, including: public IP address support, ability to start and stop instances to fulfill workloads at will, tools and APIs to automate server administration, as well as 'pay as you use' billing.

Depending on your circumstances and the type of big data application you plan on running, Google Compute Engine offers four different type of virtual servers. The lowest being the n1-standard-1-d configuration with 1 virtual core & 3.75 GB of memory and the largest the n1-standard-8-d configuration with 8 virtual cores and 30 GB of memory, with the two other configurations being in between these last two. Every Google Cloud Compute Engine instance is offered at different price points -- the more resources, the higher its hourly price. More importantly though is the ability to start and shutdown different instances at your discretion, something that allows you to adjust the type of Google Compute Engine instance in a very short time, as well as only pay for server resources you use.

The Google Compute Engine also supports multiple storage types. By default, the data used on a Google Compute Engine instance is assumed to be short lived and the moment a server -- technical the virtual machine(VM) -- is stopped, all data is lost. This storage is called ephemeral disk storage, of which a predetermined amount of space is assigned depending on the size of an instance. For cases where you wish to conserve data for a longer period (i.e. after a VM is stooped) Google Compute Engine also supports persistent disk storage where data is stored for days or months without the need to pay for a running Google Compute Engine instance, but rather a Google Compute Engine instance can later be attached to the persisted data. Persistent disk storage requires an additional payment, unlike ephemeral disk storage which is included in the hourly fee of a Google Comput Engine instance. Finally, Google Compute Engine instances can also work with data from Google Cloud Storage. Google Cloud Storage is another service of the Google Cloud platform -- which I'll describe shortly -- that would also incur in a separate bill -- like persistent disk storage.

At the time of this writing, the Google Compute Engine service requires an additional sign up process. This means you'll require additional approval -- besides having a Google Account -- and you'll also need to pay quotas from day one (i.e. just to try it out).

Google Cloud Storage.- Is a storage service that allows you to skip low-level tasks associated with classical storage systems (e.g. Relational databases & regular files). It works entirely on the web, which means that any web-enabled application can interact directly with Google Cloud Storage, as well as perform operations on it (i.e. create, read, update and delete data) via standard REST web services.

From the perspective of big data, Google Cloud Storage is very practical for managing large files. With Google Cloud Storage there are no web servers to maintain (for downloads) or FTP servers (for uploads), there is also no notion of the actual file type or its contents. In Google Cloud Storage everything is treated as objects -- essentially just 'chunks' of data -- that are transferred and retrieved using the web's protocols. And with the capability for storing objects from 1 byte up to 5 TB (Terabytes), Google Cloud Storage can be a handy service for big data operations.

Applicable toward the first Google project enabled with Google Cloud Storage you'll get the following free quotas: 5 GB of storage, 25 GB of download bandwidth, 25 GB of upload bandwidth, 30,000 GET/HEAD requests, as well as 3,000 PUT/POST/GET bucket and service requests. In addition, data residing on Google Cloud Storage can also be leveraged with other Google Cloud services, that include: Google App Engine, Google BigQuery API and the Google Prediction API.

Google Cloud SQL.- Is a service that allows you to operate a relational database management system (RDBMS) based on the capabilities of MySQL running on Google's infrastructure -- MySQL is one of the more popular open source RDBMS. Similar to other Google Cloud services, the biggest plus of using Google Cloud SQL is that you avoid having to deal with the system administration overhead involved in running a relational database management system (RDBMS).

In many situations, big data applications grow from using a small manageable RDBMS to a big RDBMS with many growing pains (e.g. space management, resource limits, backup and recovery). So if your big data applications are going to work with a RDBMS and not with a newer data storage technology (i.e. NoSQL), Google Cloud SQL can be a good option. Be aware though that the size limit for individual database instances is 10 GB, in addition Google Cloud SQL is also not a drop-in replacement for a MySQL database (e.g. there are some features that are part of MySQL but aren't available on Google Cloud SQL).

Google Cloud SQL is available under two plans with four tiers each. The tiers include D1,D2,D4 and D8 instances, with their primary feature being 0.5, 1, 2 and 4 GB of RAM per instance, respectively. In addition, proportional amounts of storage and I/O operations are also assigned depending on the tier. The plans for each of these tiers are available in either a package plan -- with monthly quotas and a monthly bill -- or under a per use plan -- with per hour and unit quotas and 'pay as you use' bill. And like other Google cloud services, Google Cloud SQL is tightly integrated to work with other services like the Google App Engine.

Google App Engine.- The Google App Engine is a platform for building applications to run on Google's infrastructure. Unlike the prior Google Cloud services that provide standalone application services (e.g. Google Compute Engine offers a virtual Linux server, Google Cloud storage space to store files,etc), the Google App Engine is an end-to-end solution for building applications. This means you design and build applications to run on the Google App Engine from the start. Although this increases the learning curve and limits the options for building an application (e.g. you can't install any software you wish on the Google App Engine, as if it were an OS like the Google Compute Engine), there's the upside of not having to worry about issues like application scaling, deployment and system administration.

Since the Google App Engine is a platform, its applications are built around a set of blueprints or APIs. The Google App Engine is supported in three programming languages: Python, Java or Go -- the last of which is a Google programming language. For each of these languages Google provides a SDK (Software Development Kit) on which you design, build and test applications on your local workstation. After you're done on your local workstation, you upload applications to the Google App Engine so end users are able to access your applications -- all applications built with one of the SDKs are compatible and uploaded to the same Google App Engine.

The Google App Engine also supports a series of storage mechanisms to archive an application's data. The default Google App Engine datastore provides a NoSQL schemaless object datastore, with a query engine and atomic transactions. The Google Cloud SQL service is another alternative that provides a relational SQL database based on the MySQL RDBMS. In addition, the Google Cloud Storage service is also available, providing a storage service for objects and files up to 5 terabytes in size.

Unlike other Google Cloud services managed on the Google APIs console, the Google App Engine has its own administrative console available at https://appengine.google.com . The Google App Engine is also available in three price tiers: free, paid and premiere. Each of the three tiers have free daily quotas, that include: 28 instance hours, 1GB of outgoing & incoming bandwidth, 1 GB of app engine datastore, as well as other resources like I/O operations and outgoing emails. For all tiers, each of these quotas are reset on a daily basis. The biggest difference of the free tier among the other two tiers is that in case an application consumes its daily quotas, you're not allowed to buy additional quotas.

On the free tier if daily resources for the Google App Engine are consumed, an application simply stops or throws an error (e.g.'Resource not available'). This means that if you're expecting a considerable amount of traffic, you should consider the paid or premiere tiers, both of which allow you to purchase additional 'pay as you use' quotas. The paid tier incurs in a minimum spend limit -- at the time of this writing of $2.10/week -- toward 'pay as you use' quotas, this means that whether your application consumes its daily quotas or not, you'll be charged a minimum of $9 per month/per application. The premiere tier is designed for cases in which you plan to deploy multiple applications and incurs in a charge of $500 per month/per account. The price difference between the paid and premiere tiers is the premiere account is billed per account -- with any number of applications -- and also includes operational support from Google, support which is not included in the other two tiers.

Google BigQuery Service.- Is a service that allows you to analyze large amounts of data, into the Terabyte range. It's essentially an analytics service that can execute SQL-like queries against datasets with billions of rows. BigQuery works with datasets and tables, where a dataset is a collection of one or more tables and tables are standard two-dimensional data tables. Queries on datasets/tables are done from either a browser or from a command line, similar to other Google Cloud storage technologies.

Though BigQuery sounds similar to a RDBMS or a service like Google Cloud SQL, since it also uses SQL-like queries and operates on two-dimensional data tables, it's different. The primary purpose of BigQuery is to analyze big data, therefore it's not well suited for constantly saving or updating data, as it's typically done on RDBMS that back most web applications. BigQuery is intended as a 'store once' storage system that's consulted over and over again to obtain insights from big data sets -- similar to the way data warehouses or data marts operate.

BigQuery offers free monthly quotas for the first 100 GB of data processing. And since BigQuery uses a columnar data structure, it means that for a given query you're only charged for data processed on each column, not the entire table -- meaning 100 GB can go a long way toward doing queries. In addition, BigQuery can interact with data residing on Google Cloud Storage and it also offers a series of sample data tables which can serve to do analytics on certain big data sets or be used to test out the service, these sample data includes: Samples from US weather stations since 1929, measurement data of broadband connection performance, birth information for the United States from 1969 to 2008, word index for works of Shakespeare or revision information for Wikipedia articles.

Prediction API.- Is a service that allows you to predict behaviors from data sets using either a regression model or a categorical model. The prediction API uses pattern-matching and machine learning under-the-hood so you can avoid the programming required to undergo regression or categorical models. Like any other prediction tool, the greater the sample data -- or training data as it's called in the Prediction API -- the greater the accuracy of a prediction.

The prediction API always requires you provide it with training data -- numbers or strings -- so it can provide you with answers to queries you want to predict. For example, by running a regression model on the prediction API it compares a given query to training data and predicts a value, based on the closeness of existing examples (e.g. for a data set containing commute times for multiple routes, predict the time for a new route). By running a categorical model on the prediction API it determines the closest category fit for a given query among all the existing examples provided in the training data (e.g. for a data set containing emails defined as spam and non-spam, predict if a new email is spam or non-spam). Regression models in the prediction API return numeric values as the result, where as categorical models return string values as the result.

The prediction API can interact with data residing on Google Cloud Storage and is also free for the first six months up to 20,000 predictions. In addition, the free period is limited to the following daily quotas: 100 predictions/day and 5MB trained/day. For usage of the prediction API beyond the free sixth month period or the free daily quotas, 'pay as you use' quotas are applied.

Translate API.- Is a tool to translate or detect text from over 60 different languages. The API is usable with either standard REST services or using REST from JavaScript (i.e. directly from a web page). Language translation and detection quotas are calculated in millions of characters. At the time of this writing, the price for translating or detecting 1 millions characters is $20, where the charges are adjusted in proportion to the number of characters actually provided (i.e. spaces aren't counted). By default, there's a limit of 2 million characters a day of processing, but this limit is adjustable up to 50 million characters a day from the Google APIs console.

Allen‘ Blog