Re:invent is Amazon Web Services’ annual event held in Las Vegas, USA. Amazon steadily keeps on selling out their event, with the 2017 event spanning over 5 hotels already (The Venetian, Mirage, MGM Grand, Aria and Encore). Amazon typically uses the event to announce new products hosted on its cloud platform and we had high expectations for their machine learning lineup.
I’m summarising my personal notes of what I feel like are the important takeaways for data engineers/machine learning engineers/data scientists from the past event week. I’m assuming you’ve got earlier experiences in the Amazon Cloud and are somewhat familiar with their broad component ecosystem and acronyms.
The TL/DR summary:
- Serverless continues trending upwards. The role of serverless microservices remains confirmed and AWS extends their offering with serverless container services and a relational database.
- End-to-end development processes seem to move to the cloud. Amazon supports data scientists and data engineers with native options for quality development environments , tightly integrated with code repositories and devops practices.
- Amazon offers managed deep learning frameworks such as Tensorflow, MXNET as native components in a development pipeline with automated hyperparameter tuning and managed deployments out-of-the-box
1. CODE DEVELOPMENT: THE RETURN OF CLOUD9.
Cloud9.io was quite a revolutionary coding platform back in its days. I could appreciate its virtualized coding environment as much as their clean and straightforward UI. Personally, I felt let down after they announcement Amazon was acquiring them in July 2016, and the subsequent product takedown
Much to my surprise, Re:Invent brought us the resurrection of Cloud9.io, rebranded as AWS Cloud9. It appears to be deeply integrated with Amazon EC2, AWS CloudFormation and Access Management. What’s great here is that the former sandbox environment that came with Cloud9 is now a fully managed EC2 instance. This gives us the flexibility to install additional packages because we have full control here. It comes with out-of-the-box support for programming languages we use and love such as Python, Java and Go.
Integration with serverless
Furthermore, Cloud9 is tightly integrated with serverless microservices through AWS Lambda. The IDE is capable of writing, running and debugging AWS Lambda functions from within the browser. Engineers push locally developed code to a live environment through a continuous integration pipeline. The current release supports AWS’s own and relatively new CodeStar as a CI/CD platform. Being more of an Atlassian/Semaphore person myself, I am looking forward to broader integration in the near future.
Personally I welcome this release. I think Cloud9 is a strong workbench for at least some aspects of data engineering, that comes with the conveniences of data science notebooks such as Jupyter.
2. COMPUTING: THE “SERVERLESS” TREND CONTINUES.
Container orchestration has always been kind of hassle on AWS. Sure, we had Elastic Container Services running Docker. The Kubernetes platform – initially a project by Google – came with considerable maintenance and setup work.
In a move to catch up with competition, Amazon announced native Kubernetes support, branded as Amazon Elastic Container Service for Kubernetes (EKS). The module is currently in preview. It basically automates the installation, upgrading and high-availability aspects of running an orchestration platform. Furthermore, the platform is built upon open-source, upstream Kubernetes so we can safely use existing plugins and tools borrowed from other cloud platforms.
Additionally, AWS announced serverless container support, branded as AWS Fargate. The technology abstracts provisioning, maintenance and scaling of the current EC2-based architecture for Elastic Container Services. This allows for billing per second during times of peak loads. The announcement promises a technology that scales out to thousands of instances in a matter of seconds. Fargate is currently available for AWS ECS loads, with support for Kubernetes (through EKS) coming up in 2018.
Hybrid Cloud and Multicloud became a whole lot easier overnight. I sure appreciate the freedom for continuous experimentation it offers. We gain considerable time and risk offset by having this as-a-service.
3. DATABASES: INTRODUCING NATIVE GRAPHS AND SERVERLESS RDBMS.
Now here’s a surprise. The announcement of Aurora Serverless (currently in preview) promises a serverless relational database, currently based on MySQL. The database will automatically start up, shut down, and scale up or down capacity based on the application’s needs. In contrast to their standard MySQL offering, called Relational Database Services, setup, maintenance and scaling of EC2 instances will disappear. Some interesting uses for this announcement could be low-volume blogs, test-and acceptation environments and new programs.
Amazon Neptune – currently in preview – extends the platforms capabilities with a dedicated graph database. Use cases include fraud detection, knowledge graphs, drug discovery, and network security amongst others. The database is accessible over Apache Tinkerpop. Now this is a move I personally consider interesting, as we’ve been working with graphs over the past half-year and have gone off-platform to avoid maintenance and setup.
4. END-TO-END MACHINE LEARNING WITH SAGEMAKER
This might easily be my personal favorite from the event. Building, training and deploying machine learning models has always been a manual endeavour at AWS. Sure, we had off-the-shelf machine images for Tensorflow. Until recently, AWS had their strategy pushing MXNET – the deep learning framework behind voice controlled Amazon Alexa. However, we didn’t see a clear path towards integration with their broader platform.
Re:Invent however brings us Amazon SageMaker, a “fully managed service that enables data scientists and developers to quickly and easily build, train, and deploy machine learning models at any scale“. The technology seems to compete with Google’s Cloud ML though without tailored GPU’s or equivalents.
The machine learning platform comes with Jupyter as a front end for quick visualisation and development. It has direct integration links with native AWS data stores such as S3, Mysql, PostgreSQL and Redshift through their own Apache Spark-based.
Sagemaker supports automated hyperparameter tuning for MXNet and Tensorflow supported out-of-the-box, with opportunities to install other frameworks at will. . On infrastructure level, model training scales automatically on NVIDIA GPU‘s. The module deploys models to a managed, autoscaling cluster of EC2 instances with a/b testing support.
This is off course quite the announcement, and hyper relevant for the work we’re doing, so expect a dedicated blog post in the near future.
5. BONUS: DEEP LEARNING ON THE EDGE.
Deeplens is a standalone, smart wireless video camera marketed as an aid for to educate machine learning engineers . The device processes camera input trough a deep learning model by using about all of the technologies discussed before. From what I understood from the talk, the model training phase remains in the cloud whereas models can be deployed either in the cloud or on the device itself, with input capturing for model retraining enabled through Lambda enable REST API’s.