#i know my team is not optimal for Kafka
Explore tagged Tumblr posts
xsumire · 1 year ago
Text
Thanks, Kafka. I'm officially converted to Nihility path in Simulated Universe
Simulated Universe - World 3 Level 5
11 notes · View notes
iamacultest · 3 months ago
Text
I wanted to repost with all of the dialogue stuff together. Along with the fact I updated some of the voice lines.
Discussing Vasha
Jade: The most priceless of gems come from the harshest of environments. It is my job to pick the gems with potential, watch them grow and learn, and turn them into the best selves they can possibly be. sigh If only destiny had other plans.
Dr. Ratio: The Stelleron Hunters are known by the IPC as a group of ruthless terrorists acting solely for Destiny Slave. However, through my time spent with that gambler, I have come to deduce that such a statement is without a doubt false.
Boothill: That kid knows how to make an impression, I'll tell you that. As mad as they come! That oughta be why we get along.
Sam/Firefly: I will never forgive Kafka for teaching him how to play card games. Or Silver Wolf for introducing him to… What was it called? Oh yea! Those “gacha” games. sigh Every time he just keeps going and going and going, the price keeps going up, and it takes F O R E V E R to stop him. It stresses me out!
Kafka: We Stelleron hunters all have different wants. Yet despite our differences, destiny strangled us up together~. Forced to be slaves to our desires. In his case, the term “slave” just so happened to be a bit more literal at the time.
Blade: Our “blessings” are the worst curse one could ask for.
Silver Wolf: Kafka has her spirit whisper and assantion training. Blade has his curse and sword. Firefly had SAM. But when it comes to me and him? We’re a bit more… like the brains of the operations. The person behind the scenes that isnt Elio.
Black Swan: Gambling with destiny is truly a cocky move. Even more so when destiny casts you out of the table. Perhaps that's exactly what destiny wanted?
Sparkle: Turning down an invitation to the tavern? If it were anyone else, I would have been appalled, but turning down the invitation to join the Stelleron Hunters?! What fun!
Vasha discussing others:
The stonehearts: For the sake of the Amber Lord, Diamond broke apart his power and bestowed his authority as an emanator to 10 individuals. The holders of his divine authority were called the cornerstones. Wanna wager how I got mine?
Dr. Ratio: I always see people online comparing him to members of the genius society. It's hilarious really. They lose themselves in false statements and lies to entertain themselves. Betting on lies that hold no meaning. Only to lose the truth entirely. The thing that makes Veritas incomplete to Nous makes him perfect for the title of doctor.
Silver Wolf: Life is a game. Somebody has to win eventually~.
Blade: Kafka always says that his fights are fun to watch. Me? I hardly see the appeal. sigh alas everyone has hobbies I suppose…
Kafka: Just like the others, I too was recruited by her. The only difference is the fact she was never supposed to.
SAM: For a machine made for the destruction of the swarm. He is surprisingly good at roasting food. You’ll have to ask Firefly next time. I’m sure she’ll gladly make you some.
Firefly: Optimism, the true wild card in a game of life or death.
Elio: As long as the future Elio sees can come true. I’m willing to sit at the table.
Sparkle: The Fools never play to win — they're more like crazed thrill seekers. (regular voice line)
Boothill: sigh Look, I don't hate the guy. Hate is a rather strong word for someone you’ve only met twice. I mean, I’m not petty- He’s just loud and talks too much- and not in the endearing way like the Docter.
Sunday: Taking a shortcut or two isn't a big deal, but blatantly dipping your hands into someone else's pockets — that lacks finesse. (Regular voice line)
Robin: I don't see the appeal to her music, but Firefly seems to like it, and oddly enough, Blade likes it too. If it gets Blade to stop having those massive mood swings of his, then I’ll just have to sit with it wont I?
Joining a team with others:
SAM/Firefly: Oh? Dont you think this is a bit too easy?
Kafka: A woman who can't feel fear, versus a gambler who ignores the feeling of uncertainty. Who will win this bet?
Blade: Let me guess? “I need no shield for I will never break.” Yea right!
Silver Wolf: Wanna play a game? Your choice~.
Jade/Topaz: Dont worry. Your precious little stone is safe and sound~.
Dr.Ratio: Bet on me will you? I’ll make it worth your while.
Boothill: A poor man knows exactly what he wants. Why not have two poor men?
The trailblazer: That Stelleron of yours seems to be in good hands.
Others joining his team:
SAM/Firefly: Can we skip the “bets”?
Kafka: Just your luck~
Blade: That ecstasy can only last for so long. Make it count.
Silver Wolf: Dont get hit too hard. Using all your stamina before the big bad is a total waste.
Jade: Stealing a relic of the Amber Lord's body? A big gamble indeed…
Dr.Ratio: A walking, talking Peacock? The Genius Society ought to come with a spectacle.
Boothill: Well fork me! The Amber Lord outta be quivering in THERE grave.
Sparkle: The grand stage is set! Care to bless us with your luck?
8 notes · View notes
prismatic-glow · 11 months ago
Text
hi guys!! for those of you who play, how has the new star rail patch been for you?
I absolutely LOVE penacony and the story was awesome! I think I've said this b4 but I love any stories based around dreams
I also got a super early black swan and won my 50/50! I think it was around 40 pity? I was super happy about that since I really wanted her but didn't think I could grind enough to guarantee her. She literally came on a random single pull I skipped
I'm very rational and like- optimal? in genshin. If you follow me/know my content you know what I mean
but in star rail I kinda do whatever I want. Usually I pull characters based on aesthetics (pretty women = pull) and although I stick to team-building basics I'm far from optimal
I am a proud hook main (built her bc e6, kept playing her bc she's adorable and hits decently hard?) and I was very glad to see black swan kindaaaaaa fits into my team although I don't even have Kafka (yet)
New posts coming soon! I think I'll be making a general guide for like- getting started on improving? Like how to farm for the right artifacts and go from bad stats to an actually impressive build that you're proud of! This is something I have a lot of personal experience with since my builds used to be really bad, I didn't have any good crit ratios but slowly I started learning how to choose good pieces, then noticing how my characters actually had decent crit ratios, and now having top 1%, 3%, and 4% characters!
that's all for this little update post, have a wonderful day!
˗ˏˋ statuses ´ˎ˗
upcoming posts ・❥・ gaming guide, build improvement guide, updated kit explanations
asks ・❥・ open!
0 notes
carolynpetit · 6 years ago
Text
Reason to Play, a Journal--Entry One: Fortnite, MGSV, and Finding Ourselves in the Act of Play
Hi. 
This is the first entry in what I hope will be an ongoing journal of play. I wanted to start by explaining my thinking behind this project.
Right now, I’m looking for a reason to play. I’m always wary of games that seem to offer nothing beyond a mildly pleasant occupation of my time, and right now, I find such games downright inadequate. Unworthy. These are horrifying times, and yet, like so many of us, I find myself exhausted by it all. Unable to maintain the levels of rage and resistance that the actions of the current administration demand. I see it all becoming normalized and I feel powerless to stop it. And as the days and weeks and months go by, I feel as if this numbness accrues. I become increasingly detached, not just from the horrors of the moment but from myself. I start to wonder where the person I believed myself to be has gone. 
I believe that art is most vital in times like this. I love this quote from Kafka: 
“I think we ought to read only the kind of books that wound or stab us. If the book we're reading doesn't wake us up with a blow to the head, what are we reading for?...We need books that affect us like a disaster, that grieve us deeply, like the death of someone we loved more than ourselves, like being banished into forests far from everyone, like a suicide. A book must be the axe for the frozen sea within us. That is my belief.”
If a game isn’t going to be the axe for the frozen sea inside me, if it isn’t going to cut through the numbness, shake me up, break my heart, fuck me up, do something to rehumanize me, it is not worthy of this moment. 
But I might find what I’m looking for anywhere. I’m not talking just about games that explicitly comment on fascism or racial injustice or economic inequality. Yes, I think it’s essential that we have art, including games, that confronts these things directly, but it’s also true that a game can have the noblest aims and leave me cold, while a throwaway moment in a big-budget mainstream game of the sort that certain gamers like to call “apolitical” can crack my heart wide open. 
Like most of my writing about games, this journal will be a place where I fully embrace the subjectivity of my own experience with the games that I play.
Okay. Here we go.
Testin’ My Mind, Shakin’ My Body in Fortnite
Yeah, okay, Fortnite’s a Battle Royale. That’s just a fact. If you’re playing solo, which I almost always am--I’m uncomfortable teaming up with random players, though on occasion I’ll play duos with a friend, which makes for a completely different, really exciting dynamic--you drop onto the island with close to a hundred other players, and the way you win is by being the last player standing. Now, I encourage conversations about the violence inherent to the format, as well as about all the other aspects of Fortnite that people rightly raise concerns about--the way in which it’s monetized, Epic’s pattern of repeatedly profiting off of dances associated with artists and communities of color without compensating the artists or communities that created them. All of it. But if we’re gonna go to the mat with Fortnite on these aspects (and we should), let’s also at least have a full, multifaceted conversation about why we play Fortnite, how it feels, and the moments that can emerge from a fully invested experience of the game.
Did you know that earlier this year, a massive beast that had been frozen in ice under Polar Peak broke free, that huge footprints showed it had made its way to the sea, where it’s occasionally been spotted, roaming the waters around the island? Did you know that right now, a towering robot is being built in the remnants of the volcano? It seems inevitable that soon, a massive Pacific Rim-style fight between them will take place, almost certainly resulting in a new wave of major changes to the island. Indeed, the island is always a place in flux, changing in big and small ways. It’s alive in ways that I’ve always wanted my game worlds to be alive. Landing near Loot Lake a few weeks ago, I was excited to see that the massive power cable that runs through the area was shredded and sparking, as if perhaps the monster had taken a bite. 
But the life of the environment wouldn’t mean much if it weren’t for my encounters with the lives of other players. The other day, I was trying to complete a challenge that required me to get a certain score on a balloon board at one of the numerous little beach party setups that currently dot the map. Jumping from the bus, I swooped down to a spot in the desert, opened a chest, grabbed the weapon, and made my way over to the nearby board. Another player got there just before me, and I stood still, hoping to indicate that I didn’t want to stop them from completing the challenge. They froze for a moment, but then proceeded, and when they hit the necessary score, a little celebratory explosion of confetti occurred, and I got credit for the challenge, too. 
Basking in the glow of our shared little moment, I wanted to walk away then, wishing them nothing but the best in the match ahead. But then they took a shot at me. In that instant, a sinking feeling ran through my whole body, a physical expression of “Aw, why’d you have to go and do that?” and in an instant, I obliterated them. It wasn’t a victory. It was more like putting someone down. I didn’t feel good about it, but it sure was a real feeling. Something surprising and immediate that emerged from my encounter with another living person. And that’s what I’m here for. 
Yes, Fortnite is a Battle Royale, but so much of the experience of Fortnite is about unexpected occurrences like this, and about the things we do in the stolen moments between the shootouts and build battles. The other day, I got so caught up in playing a silly memory game I stumbled upon that I wound up getting caught in the storm. Not long before that, I danced with John Wick to raise a disco ball in an abandoned lair so we could snag a fortbyte, one of this season’s collectibles. These are the things I really remember, not my win-loss ratio or all the times I’m eliminated by players much better than I am before I quickly hit play and hop on the battle bus all over again.
Tumblr media
I’m eager to return to the island because the island itself feels vibrant and alive, emanating a kind of Spielbergian Americana and optimism, but also because of the vigorous bodies and exuberant identities I get to inhabit while I’m there. The mix-and-match nature of Fortnite’s customization means that one round I might be a sprightly female wizard with a sleek laptop on her back, and the next a nerdy, purple-haired gamer girl with a satchel full of potions and spellbooks. “Fun” may be overemphasized in some of our conversations around games, but it certainly has its place, and playing as these colorful characters, well, it’s just fun.
Every character in Fortnite plays exactly the same, but they don’t all feel the same to me. I just unlocked a black variant of the character Sentinel, a robot or power suit that looks like it might have appeared on Mighty Morphin’ Power Rangers, and I think it looks kinda cool, but I sure don’t want to be it. On the other hand, playing as Elmira (pictured above) feels good. And oh, do I love the way that some emotes make me feel. Tweeting recently about an emote called the Laid Back Shuffle, I wrote:
I’m almost always pretty uncomfortable in my body, for a number of reasons related to my appearance and my transness and things. The easygoing physical exuberance of this emote, the way that the avatar performing it, whatever avatar that might be at any given moment, appears to feel so loose and free in their own body, makes it really appealing to me, like a virtual experience/expression of a sensation that I’ve never known IRL. I think emotes have some kind of power beyond whatever power we often think of them having, perhaps particularly for those of us who never really feel comfortable in our own skin. 
And all the kids playing Fortnite that we’re so worried about, let’s remember that their experience of this game isn’t as simple as just trying to slaughter everyone else on the island. Setting aside whatever value there may be in the particular type of complex thinking and skill-building that it requires to try to simultaneously outbuild and outgun your opponent, there’s also the fact that they, too, are experiencing the life of Fortnite’s island, having encounters with other players that play out in unexpected ways, and experimenting with self-expression. Yes, their opportunities for that exploration and expression are gated by money, and that’s a real issue, but that doesn’t change the fact that a young person finding that they feel particularly cool when playing as a woman in red with a bionic arm is valid, and maybe even valuable. 
II. MGSV and What I Know Is True
I set The Phantom Pain aside for a few years after hitting a mission that I found maddeningly difficult, but something called me back to it. Now I’ve powered through the mission that gave me so much trouble, and I’m making progress again. I enjoy the geographical roughness of its environments, and the way you really have to deal with that roughness, often lying flat and crawling along the ground. The truth is that I spend far too much time alone in my apartment, and though it’s no substitute at all for the real, natural world, when I take my time being rooted in one spot to scout out locations and tag enemies before making any dangerous moves, I feel the shape of the space around me in a way that I rarely do in games. 
The other day I fought a grueling boss battle and then, finally, when it was over, hopped onto the helicopter to return to base, exhausted by the ordeal. Just as we were about to lift off, Quiet hopped on, hanging off of the side of the chopper as the rotors above her head spun faster until we lurched up and away from the ground. She held my gaze the whole time. I think a lot of games look at the player too much. They want you to feel like the center of the universe, the only person who really matters. But that wasn’t the feeling I got from this moment. I’d just fought for my life, and the way she looked at me, without malice or sympathy for what I’d just been through or anything, made me feel like I was being sized up. Looked at in a real way. Seen.
Do you know that feeling--Does this happen to everyone or just me?--that feeling where, for a moment, your awareness kind of spreads beyond yourself and you’re suddenly very aware that what you’re experiencing is something real that is happening in physical, three-dimensional space at this exact moment in time? It’s a feeling I get sometimes when I’m in a moment that I wish I could make last, or that I really want to remember. Sharing a last drink with a friend before they move away, that sort of thing. This feeling of momentarily being very much rooted in myself but also outside of myself and acknowledging, This is real. This is something that happened. That moment where Quiet was looking at me in the wake of the momentous battle I’d just fought felt something like that. 
It didn’t happen in real, physical space, but virtual space is a valid space, too, a space where real things happen. Sometimes when I’m playing Fortnite I’ll see the hillside where a friend and I once sped away from attackers on a Quadcrasher, bullets whizzing past our heads, and I’ll think, We were there. That happened. These moments become part of my relationship with the ever-changing island, just as my memories of San Francisco become part of my relationship with the city.
On another recent mission, I was sneaking my way through an enemy outpost when, from a nearby building, I heard the familiar sounds of Spandau Ballet’s “True.” To be honest, I never liked “True” much. The Phantom Pain takes place in 1984, and as a kid in the suburbs of Chicago in that year who sometimes saw the video on MTV, the song felt too airy and ethereal to move me. But recontextualized in The Phantom Pain, I heard it differently. That precise ethereal quality made it such an effective contrast to the grim military seriousness and the tactile terrain that my heart began to ache. 
The presence of 80s pop songs in the isolated military outposts of the game is politically fascinating to me. It says something about how American and British cultural exports are absorbed by the entire world, but it’s largely a one-way street. A Pakistani friend of mine in high school had grown up with Sting, Bruce Springsteen, Elvis, but I’d never heard Pakistani music in my life. I don’t understand why so many players are so intent on not considering all the political dimensions of a game like this. They only make the experience infinitely more fascinating, even if and when they reveal the game’s failures.
The songs also allow for the creation of some great moments. I snuck into the building where the song was playing just so I could snag the tape, and the next time I was in the helicopter, I played it, and as the opening notes of “True” played, I panned the camera slowly around Big Boss, creating a very short music video that I honestly found exciting.
Tumblr media
I tweeted the clip, jokingly commenting that I’d “won Metal Gear Solid V by creating this beautiful moment,” but it had really felt this way to me. Creating this moment had been as fun and rewarding to me as anything else the game offered. Playing MGSV isn’t just sneaking and shooting, or at least for me it isn’t. This, too, is play.  So obviously, I get frustrated with the “Git Gud” players, those who feel that games are at their best when they’re perfectly calibrated tests of raw skill, that the only thing that matters is having an awesome KDR, or earning the highest possible rating on missions, or whatever. 
But the truth is that it’s not just hardcore gamers who set limits on our notions of play by talking about games like this. A lot of us do it, even a lot of us who consider ourselves emphatically opposed to the “Git Gud’ brigade. We do it when we look at a game like Fortnite and see it only as one simple thing, a struggle to be the last remaining survivor, without at least acknowledging all the other things a player might go to the game for. We do it when we deny the possibility for moments of strange beauty to emerge from even a grim, ugly, grossly misogynistic game like MGSV. We do it whenever we, ourselves, adopt a limited, conventional understanding of what it means to really play a game, rather than fully engaging with all the different ways that we can find ourselves and each other in the spaces that games create.
-----
I’m currently looking for work. If you enjoy my writing and are in a position to do so, please consider supporting me on ko-fi.
13 notes · View notes
toddbirchard-architect · 5 years ago
Photo
Tumblr media
Data Engineering, Big Data, and Other Vague Vocabulary https://ift.tt/2Kh85xj
Tumblr media
I've spent the majority of my life dreading an eternal question that governs our lives. You know the one. It's the one that comes after our ritualistic handshakes and "nice to meet you"s. The one that summarizes our place in society, in 5 words or less: "what do you?"
Most managers never seem to have this problem. My previous peers in product or engineering management roles had little trouble letting others fill in the blanks for them, but I've never been one comfortable with accepting hyperbolic inferences. For non-producing members of skilled teams, I doubt the integrity of one who nods in response to "oh, so you're the boss?" I instead relived groundhog day eternally, watching the progression where an acquaintance's eagerness to care deteriorates into realizing they don't.
A lifetime later, I landed my first title as a data engineer, and boy did that feel great! After years of enduring the cocktail-party-existential-crises, I had a real title. Fine, "manager" is a title, but this title had tangible substance! The first chance I had to introduce myself as a Data Engineer happened to be in Ibiza, in all places. As it turns out, an American stranded in Spain making friends with somebody from Bosnia has its language barriers, so the phrase "data engineer" wasn't quite translating well. The best stand-in explanation I could find was "hacker."
Data Engineers Are Definitely Not Hackers
I had a lot of assumptions about what it meant to be a "data engineer" going into it, and none of them were particularly outrageous. I'd had my hand in software development for over ten years at the time. The boom of mainstream data science bit be like a bug, like the rest of us, and something about the problems we could solve seemed to make software fun again. We weren't building worthless landing pages, or tired login screens. Instead, we could write sports betting algorithms, or mine the world's unprotected data. I already loved engineering like I loved Oreos, and this particular flavor of engineering felt like taking two Oreos apart and stick them back together: less of the lame stuff, more of the awesome stuff.
Data Engineering isn't really Software Engineering
Obviously you need to be a software engineer to some capacity to be a data engineer. That said, the concerns of data engineers fall further away from the tree than I ever initially anticipated.
Most programming work I engaged in before data revolved heavily around algorithms, whether I realized it at the time or not. Building consumer and business-facing products entails more moving parts than any single human can account for. Software worth using is an effort between many people accountable for many services, which make up some abstract entity used by vast quantities of unreasonable people (I kid). The challenge of engineering something complex comes in the clever decisions we make to leverage simplicity. The first time I ever dissected a Walkman, or took the lid off a toilet, or taken apart a mechanical pen, the reaction is always the same: "that's it?" And yet, "that's quite genius."
A Day In The Life
The skills and duties of data engineering teams zero consistency between companies. Some shops integrate data engineers with data scientists and analysts to supplement those teams. Other companies have massively siloed "big data" teams, which are almost always made up of Java developers who have seemingly found a comfortable (and lucrative) niché, forever married to MapReduce without the burdens of cross-department communication. Unfortunately, this scenario is far more common.
Most of a data engineer’s responsibilities revolve around ETL: the art of moving data from over there to over here. Or, perhaps also there. And yet, likely here, there, and there (and oh yeah, it nothing is allowed to break, even though those things are different). The concept feels straightforward. It is. We're also dealing with incomprehensibly massive amounts of data, so it's also repetitively stressful. Straightforward and stressful aren't the sexiest adjectives to live by.
Tools Over Talent
Luckily for us, our company isn't the first company to work with data- that’s where our predetermined catalog of “big data” tools comes in. No matter how different data teams are between companies, the inescapable common ground is that data Engineering is largely about tools. We’re talking Apache Hadoop, Apache Spark, Apache Kafka, Apache Airflow, Apache 2: Electric Boogaloo, and so forth.  
Working with each of these things is a proprietary skill of its own. PySpark is essentially its own language masquerading as Python. Hadoop's complexity serves little purpose other than to ensure old school data engineers have jobs. Each of these tools are behemoths, each of which was created to do a very specific thing in a very specific way. Becoming adept at Spark doesn’t make you a better engineer, or a problem solver: it just makes you good at using Spark. Airflow is a powerful tool for organizing and building data pipelines. With all it’s included bells and whistles, Airflow offers teams power and structure at no cost. It’s obvious that Airflow (and equivalent) are “the right tool” upon using it, but structure comes at a price to human beings. It’s only a matter of time before I’m aware I’m mindlessly executing things in the only possible fashion they might be executed. Unlike building complex systems, it feels like data engineering only has so much room for clever optimization.  
This doesn’t seem so bad to a 9-5 worker looking to live their non-office lives: hoarding lucrative knowledge is an easy way to pay the bills. What bothers me is this mindset can only prevail if the person harnessing does not actually enjoy programming. In every software engineering interview I've ever had, there's inevitably been some sort of hour-long algorithm whiteboard session where you optimize your brute force O(n^2) algorithm to O(n). While those are stressful, people who enjoy programming usually walk out of those interviews feeling like they enjoyed it. I've never been asked an algorithm question in a data engineering interview. Those go more like this:
Have you ever had a situation where you had to configure a Kafkta instance using the 76C-X configuration variable on the 27 of May during a full moon?
I see you've worked with SQS, Kinesis, Kafka, Pub/Sub, and RabbitMQ, but have you ever worked with [obscure equivalent service this company uses, with the implication that it isn't exactly the same]
I know you're not too hot on Hadoop, but can you tell me about the inner workings of this specific feature before it was depreciated 3 years ago anyway?
I'm running a PC with 4 cores and 16 gigs of ram, looking to parse a 200,000-line JSON file while vacationing with my family in Aruba. Which Python library would you use to engage Python's secret Hyperthreaded Voltron I/O Super Saiyan skill, and what kind of load would my machine be under as a result?
I'm barely kidding about these... even the last one. If Silicon Valley's primary hiring philosophy prioritizes smart people who can learn, data engineering interviews measure whether your wealth of useless trivia is culturally acceptable by people who value that sort of thing.
We Need To Address "Big Data"
I've been making some vast generalizations so far. I don't truly believe all data engineers share the same personality traits. In fact, there are at least two kinds of data people in this world: people who say "big data" unironically, and those who don't. I'm the latter. The complaints I have about our profession are directed at the former.
There's a big difference between a startup looking to "revolutionize the world with AI," and startups looking to leverage machine learning to optimize a case where it makes sense. Given the cheapness and implied misunderstanding of the term, simply hearing the phrase "AI" in a conversation has me questioning credibility. Don't get me started on Blockchain.
Big data has no actual definition other than "a lot of data." Trying to track down the origins of the phrase results in endless pages of data companies spewing absurd jargon (and hilariously copy+pasted definitions from one another), proudly aligning themselves with the new world order of Big Data. One article titled "A Brief History of Big Data" starts at the year 18,000 BCE. Get over yourselves.
In reality, the phrase "Big Data" started to pick up pace around 2012:
trends.embed.renderExploreWidget("TIMESERIES", {"comparisonItem":[{"keyword":"big data","geo":"","time":"2004-01-01 2019-06-29"}],"category":0,"property":""}, {"exploreQuery":"date=all&q=big%20data","guestPath":"https://trends.google.com:443/trends/embed/"});
We have Doug Laney to blame for coining the phrase in 2001, but if I had to guess, the trend seems much more closely correlated with the rise of Hadoop.
Hadoop enabled companies to work with and process much larger data than before, thus "Big Data" was technically relatively accurate. Java was by far the most common programming language being learned by new developers, being the de facto choice for school curriculum and general programming. I imagine it was an easy choice for many to double down on the language they knew by leveraging their knowledge and being Hadoop subject-matter experts. That's twice the job security and twice the skills!
Most people I know who overly emphasize their "big data" expertise are in fact Java/Hadoop guys. They're quick to ask how many petabytes or exabytes of data your last pipeline ran, fiercely keeping the gate open for only the Biggest of Data. They don't want to learn new programming languages. They don't want to see which data warehouse best fits their needs by reading the whitepapers. They don't want to question if it's really necessary for a cluster of hundreds of nodes to run small nightly jobs. They want to cling to a time where they made two good consecutive life decisions and partied to the Big Data anthem.
Bigger Doesn't Mean Better
Some data engineers are exactly what their titles imply: engineers with a specialty in data. On the other side of this, there's a rampant culture of gatekeeping and self-preservation which is almost certainly destroying company budgets in ways which aren't visible.
Data engineering teams with headcounts in the double-digits clock 8 hours a day, over-implementing systems too obsolete to turn profits for Cloudera, Hortonworks, or MapR. If these teams had consisted of software engineers as opposed to big data engineers, we would have teams focused on creating the best solutions over the easiest ones.
July 31, 2019 at 12:24AM
0 notes
freshcodeit-blog · 6 years ago
Text
Introduction to message brokers. Part 1: Apache Kafka vs RabbitMQ
Tumblr media
The growing amount of equipment, connected to the Net has led to a new term, Internet of things (or IoT). It came from the machine to machine communication and means a set of devices that are able to interact with each other. The necessity of improving system integration caused the development of message brokers, that are especially important for data analytics and business intelligence. In this article, we will look at 2 big data tools: Apache Kafka and RabbitMQ.
Why did message brokers appear?
Can you imagine the current amount of data in the world? Nowadays, about 12 billion “smart” machines are connected to the Internet. Considering about 7 billion people on the planet, we have almost one-and-a-half device per person. By 2020, their number will significantly increase to 200 billion, or even more. With technological development, building of “smart” houses and other automatic systems, our everyday life becomes more and more digitized.
Message broker use case
As a result of this digitization, software developers face the problem of successful data exchange. Imagine you have your own application. For example, it’s an online store. So, you permanently work in your technological scope, and one day you need to make the application interact with another one. In previous times, you would use simple “in points” of the machine to machine communication. But nowadays we have special message brokers. They make the process of data exchange simple and reliable. These tools use different protocols that determine the message format. The protocols show how the message should be transmitted, processed, and consumed.
Messaging in a nutshell
Wikipedia asserts that a message broker “translates a message from the formal messaging protocol of the sender to the formal messaging protocol of the receiver”.
Programs like this are essential parts of computer networks. They ensure transmitting of information from point A to point B.
Tumblr media
When a message broker is needed?
If you want to control data feeds. For example, the number of registrations in any system.
When the task is to put data to several applications and avoid direct usage of their API.
The necessity to complete processes in a defined order like a transactional system.
So, we can say that message brokers can do 4 important things:
divide the publisher and consumer
store the messages
route messages
check and organize messages
There are self-deployed and cloud-based messaging tools. In this article, I will share my experience of working with the first type.
Message broker Apache Kafka
Pricing: free
Official website: https://kafka.apache.org/
Useful resources: documentation, books
Pros:
Multi-tenancy
Easy to pick up
Powerful event streaming platform
Fault-tolerance and reliable solution
Good scalability
Free community distributed product
Suitable for real-time processing
Excellent for big data projects
Cons:
Lack of ready to use elements
The absence of complete monitoring set
Dependency on Apache Zookeeper
No routing
Issues with an increasing number of messages
What do Netflix, eBay, Uber, The New York Times, PayPal and Pinterest have in common?  All these great enterprises have used or are using the world’s most popular message broker, Apache Kafka.
THE STORY OF KAFKA DEVELOPMENT
With numerous advantages for real-time processing and big data projects, this asynchronous messaging technology has conquered the world. How did it start?
In 2010 LinkedIn engineers faced the problem of integration huge amounts of data from their infrastructure into a lambda architecture. It also included Hadoop and real-time event processing systems.
As for traditional message brokers, they didn’t satisfy Linkedin needs. These solutions were too heavy and slow. So, the engineering team has developed the scalable and fault-tolerant messaging system without lots of bells and whistles. The new queue manager has quickly transformed into a full-fledged event streaming platform.
APACHE KAFKA CAPABILITIES
The technology has become popular largely due to its compatibility. Let’s see. We can use Apache Kafka with a wide range of systems. They are:
web and desktop custom applications
microservices, monitoring and analytical systems
any needed sinks or sources
NoSQL, Oracle, Hadoop, SFDC
With the help of Apache Kafka, you can successfully create data-driven applications and manage complicated back-end systems. The picture below shows 3 main capabilities of this queue manager.
 As you can see, Apache Kafka is able to:
publish and subscribe to streams of records with excellent scalability and performance, which makes it suitable for company-wide use.
durably store the streams, distributing data across multiple nodes for a highly available deployment.
process data streams as they arrive, allowing you aggregating, creating windowing parameters, performing joins of data within a stream, etc.
APACHE KAFKA KEY TERMS AND CONCEPTS
First of all, you should know about the abstraction of a distributed commit log. This confusing term is crucial for the message broker. Many web developers used to think about "logs" in the context of a login feature. But Apache Kafka is based on the log data structure. This means a log is a time-ordered, append-only sequence of data inserts. As for other concepts, they are:
topics (the stored streams of records)
records (they include a key, a value, and a timestamp)
APIs (Producer API, Consumer API,  Streams API, Connector API)
The interaction of the clients and the servers are implemented with easy to use and effective TCP protocol. It’s language agnostic standard. So, the client can be written in any language that you want.
KAFKA WORKING PRINCIPLE
There are 2 main patterns of messaging:
queuing
publish-subscribe
Both of them have some pros and cons. The advantage of the first pattern is the opportunity to easily scale the processing. On the other hand, queues aren't multi-subscriber. The second model provides the possibility to broadcast data to multiple consumer groups. At the same time, scaling is more difficult in this case.
Apache Kafka magically combines these 2 ways of data processing, getting benefits of both of them. It should be mentioned that this queue manager provides better ordering guarantees than a traditional message broker.
KAFKA PECULIARITIES
Combining the functions of messaging, storage, and processing, Kafka isn’t a common message broker. It’s a powerful event streaming platform capable of handling trillions of messages a day. Kafka is useful both for storing and processing historical data from the past and for real-time work. You can use it for creating streaming applications, as well as for streaming data pipelines.
If you want to follow the steps of Kafka users, you should be mindful of some nuances:
the messages don’t have separate IDs (they are addressed by their offset in the log)
the system doesn’t check the consumers of each topic or message
Kafka doesn’t maintain any indexes and doesn’t allow random access (it just delivers the messages in order, starting with the offset)
the system doesn’t have deletes and doesn’t buffer the messages in userspace (but there are various configurable storage strategies)
CONCLUSION
Being a perfect open-source solution for real-time statistics and big data projects, this message broker has some weaknesses. The thing is it requires you to work a lot. You will feel a lack of plugins and other things that can be simply reused in your code.
I recommend you to use this multiple publish/subscribe and queueing tool, when you need to optimize processing really big amounts of data ( 100 000 messages per second and more). In this case, Apache Kafka will satisfy your needs.
Message broker RabbitMQ
Pricing: free
Official website: https://www.rabbitmq.com
Useful resources: tools, best practices
Pros:
Suitable for many programming languages and messaging protocols
Can be used on different operating systems and cloud environments
Simple to start using and to deploy
Gives an opportunity to use various developer tools
Modern in-built user interface
Offers clustering and is very good at it
Scales to around 500,000+ messages per second
Cons:
Non-transactional (by default)
Needs Erlang
Minimal configuration that can be done through code
Issues with processing big amounts of data
The next very popular solution is written in the Erlang. As it’s a simple, general-purpose, functional programming language, consisted of many ready to use components, this software doesn’t require lots of manual work. RabbitMQ is known as a “traditional” message broker, which is suitable for a wide range of projects. It is successfully used both for development of new startups and notable enterprises.
The software is built on the Open Telecom Platform framework for clustering and failover. You can find many client libraries for using the queue manager, written on all major programming languages.
THE STORY OF RABBITMQ DEVELOPMENT
One of the oldest open source message brokers can be used with various protocols. Many web developers like this software, because of its useful features, libraries, development tools, and instructions.
 In 2007, Rabbit Technologies Ltd. had developed the system, which originally implemented AMQP. It’s an open wire protocol for messaging with complex routing features. AMQP ensured cross-language flexibility of using message broking solutions outside the Java ecosystem. In fact, RabbitMQ perfectly works with Java, Spring, .NET, PHP, Python, Ruby, JavaScript, Go, Elixir, Objective-C, Swift and many other technologies. The numerous plugins and libraries are the main advantage of the software. 
RABBITMQ CAPABILITIES
Created as a message broker for general usage, RabbitMQ is based on the pub-sub communication pattern. The messaging process can be either synchronous or asynchronous, as you prefer. So, the main features of the message broker are:
Support of numerous protocols and message queuing, changeable routing to queues, different types of exchange.
Clustering deployment ensures perfect availability and throughput. The software can be used across various zones and regions.
The possibilities to use Puppet, BOSH, Chef and Docker for deployment. Compatibility with the most popular modern programming languages.
The opportunity of simple deployment in both private and public clouds.  
Pluggable authentication, support of  TLS and LDAP, authorization.
Many of the proposed tools can be used for continuous integration, operational metrics, and work with other enterprise systems.
Tumblr media
RABBITMQ WORKING PRINCIPLE
Being a broker-centric program, RabbitMQ gives guarantees between producers and consumers. If you choose this software, you should use transient messages, rather than durable.
The program uses the broker to check the state of a message and verify whether the delivery was successfully completed. The message broker presumes that consumers are usually online.
As for the message ordering, the consumers will get the message in the published order itself. The order of publishing is managed consistently.
RABBITMQ PECULIARITIES
The main advantage of this message broker is the perfect set of plugins, combined with nice scalability. Many web developers enjoy clear documentation and well-defined rules, as well as the possibility of working with various message exchange models. In fact, RabbitMQ is suitable for 3 of them:
Direct exchange model (individual exchange of topic one be one)
Topic exchange model (each consumer gets a message which is sent to a specific topic)
Fanout exchange model (all consumers connected to queues get the message).
Here you can see the gap between Kafka and RabbitMQ. If a consumer isn’t connected to a fanout exchange in RabbitMQ, the message will be lost. At the same time, Kafka allows avoiding this, because any consumer can read any message.
CONCLUSION
As for me, I like RabbitMQ due to the opportunity to use many plugins. They save time and speed-up work. You can easily adjust filters, priorities, message ordering, etc. Just like Kafka, RabbitMQ requires you to deploy and manage the software. But it has convenient in-built UI and allows using SSL for better security. As for abilities to cope with big data loads, here RabbitMQ is inferior to Kafka.
To sum up, both Apache Kafka and RabbitMQ truly worth the attention of skillful software developers. I hope, my article will help you find suitable big data technologies for your project. If you still have any questions, you are welcome to contact Freshcode specialists. In the next review we will compare other powerful messaging tools, ActiveMQ and Redis Pub/Sub.
The original article Introduction to message brokers. Part 1: Apache Kafka vs RabbitMQ was published at freshcodeit.com.
0 notes
insiderandroidtk-blog · 8 years ago
Text
RxJava support 2.0 releases support for Android apps
The RxJava team has released version 2.0 of their reactive Java framework, after an 18 month development cycle. RxJava is part of the ReactiveX family of libraries and frameworks, which is in their words, "a combination of the best ideas from the Observer pattern, the Iterator pattern, and functional programming". The project's "What's different in 2.0" is a good guide for developers already familiar with RxJava 1.x.
RxJava 2.0 is a brand new implementation of RxJava. This release is based on the Reactive Streams specification, an initiative for providing a standard for asynchronous stream processing with non-blocking back pressure, targeting runtime environments (JVM and JavaScript) as well as network protocols.
Reactive implementations have concepts of publishers and subscribers, as well as ways to subscribe to data streams, get the next stream of data, handle errors and close the connection.
The Reactive Streams spec will be included in JDK 9 as java.util.concurrent.Flow. The following interfaces correspond to the Reactive Streams spec. As you can see, the spec is small, consisting of just four interfaces:
·         Flow.Processor<T,R>: A component that acts as both a Subscriber and Publisher.
·         Flow.Publisher<T>: A producer of items (and related control messages) received by Subscribers.
·         Flow.Subscriber<T>: A receiver of messages.
·         Flow.Subscription: Message control linking a Flow.Publisher and Flow.Subscriber.
Spring Framework 5 is also going reactive. To see how this looks, refer to Josh Long's Functional Reactive Endpoints with Spring Framework 5.0.
To learn more about RxJava 2.0, InfoQ interviewed main RxJava 2.0 contributor, David Karnok.
InfoQ: First of all, congrats on RxJava 2.0! 18 months in the making, that's quite a feat. What are you most proud of in this release?
David Karnok: Thanks! In some way, I wish it didn't take so long. There was a 10 month pause when Ben Christensen, the original author who started RxJava, left and there was no one at Netflix to push this forward. I'm sure many will contest that things got way better when I took over the lead role this June. I'm proud my research into more advanced and more performant Reactive Streams paid off and RxJava 2 is the proof all of it works.
InfoQ: What’s different in RxJava 2.0 and how does it help developers?
Karnok: There are a lot of differences between version 1 and 2 and it's impossible to list them all here ,but you can visit the dedicated wiki page for a comprehensible explanation. In headlines, we now support the de-facto standard Reactive Streams specification, have significant overhead reduction in many places, no longer allow nulls, have split types into two groups based on support of or lack of backpressure and have explicit interop between the base reactive types.
InfoQ: Where do you see RxJava used the most (e.g. IoT, real-time data processing, etc.)?
Karnok: RxJava is more dominantly used by the Android community, based on the feedback I saw. I believe the server side is dominated by Project Reactor and Akka at the moment. I haven't specifically seen IoT mentioned or use RxJava (it requires Java), but maybe they use some other reactive library available on their platform. For real-time data processing people still tend to use other solutions, most of them not really reactive, and I'm not aware of any providers (maybe Pivotal) who are pushing for reactive in this area.
InfoQ: What benefits does RxJava provide Android more than other environments that would explain the increased traction?
Karnok: As far as I see, Android wanted to "help" their users solving async and concurrent problems with Android-only tools such as AsyncTask, Looper/Handler etc.
Unfortunately, their design and use is quite inconvenient, often hard to understand or predict and generally brings frustration to Android developers. These can largely contribute to callback hell and the difficulty of keeping async operations off the main thread.
RxJava's design (inherited from the ReactiveX design of Microsoft) is dataflow-oriented and orthogonalized where actions execute from when data appears for processing. In addition, error handling and reporting is a key part of the flows. With AsyncTask, you had to manually work out the error delivery pattern and cancel pending tasks, whereas RxJava does that as part of its contract.
In practical terms, having a flow that queries several services in the background and then presents the results in the main thread can be expressed in a few lines with RxJava (+Retrofit) and a simple screen rotation will cancel the service calls promptly.
This is a huge productivity win for Android developers, and the simplicity helps them climb the steep learning curve the whole reactive programming's paradigm shift requires. Maybe at the end, RxJava is so attractive to Android because it reduces the "time-to-market" for individual developers, startups and small companies in the mobile app business.
There was nothing of a comparable issue on the desktop/server side Java, in my opinion, at that time. People learned to fire up ExecutorService's and wait on Future.get(), knew about SwingUtilities.invokeLater to send data back to the GUI thread and otherwise the Servlet API, which is one thread per request only (pre 3.0) naturally favored blocking APIs (database, service calls).
Desktop/server folks are more interested in the performance benefits a non-blocking design of their services offers (rather than how easy one can write a service). However, unlike Android development, having just RxJava is not enough and many expect/need complete frameworks to assist their business logic as there is no "proper" standard for non-blocking web services to replace the Servlet API. (Yes, there is Spring (~Boot) and Play but they feel a bit bandwagon-y to me at the moment).
InfoQ: HTTP is a synchronous protocol and can cause a lot of back pressure when using microservices. Streaming platforms like Akka and Apache Kafka help to solve this. Does RxJava 2.0 do anything to allow automatic back pressure?
Karnok: RxJava 2's Flowable type implements the Reactive Streams interface/specification and does support backpressure. However, the Java level backpressure is quite different from the network level backpressure. For us, backpressure means how many objects to deliver through the pipeline between different stages where these objects can be non uniform in type and size. On the network level one deals with usually fixed size packets, and backpressure manifests via the lack of acknowledgement of previously sent packets. In classical setup, the network backpressure manifests on the Java level as blocking calls that don't return until all pieces of data have been written. There are non-blocking setups, such as Netty, where the blocking is replaced by implicit buffering, and as far as I know there are only individual, non-uniform and non Reactive Streams compatible ways of handling those (i.e., a check for canWrite which one has to spin over/retry periodically). There exist libraries that try to bridge the two worlds (RxNetty, some Spring) with varying degrees of success as I see it.
InfoQ: Do you think reactive frameworks are necessary to handle large amounts of traffic and real-time data?
Karnok: It depends on the problem complexity. For example, if your task is to count the number of characters in a big-data source, there are faster and more dedicated ways of doing that. If your task is to compose results from services on top of a stream of incoming data in order to return something detailed, reactive-based solutions are quite adequate. As far as I know, most libraries and frameworks around Reactive Streams were designed for throughput and not really for latency. For example, in high-frequency trading, the predictable latency is very important and can be easily met by Aeron but not the main focus for RxJava due to the unpredictable latency behavior.
InfoQ: Does HTTP/2 help solve the scalability issues that HTTP/1.1 has?
Karnok: This is not related to RxJava and I personally haven't played with HTTP/2 but only read the spec. Multiplexing over the same channel is certainly a win in addition to the support for explicit backpressure (i.e., even if the network can deliver, the client may still be unable to process the data in that volume) per stream. I don't know all the technical details but I believe Spring Reactive Web does support HTTP/2 transport if available but they hide all the complexity behind reactive abstractions so you can express your processing pipeline in RxJava 2 and Reactor 3 terms if you wish.
InfoQ: Java 9 is projected to be featuring some reactive functionality. Is that a complete spec?
Karnok: No. Java 9 will only feature 4 Java interfaces with 7 methods total. No stream-like or Rx-like API on top of that nor any JDK features built on that.
InfoQ: Will that obviate the need for RxJava if it is built right into the JDK?
Karnok: No and I believe there's going to be more need for a true and proven library such as RxJava. Once the toolchains grow up to Java 9, we will certainly provide adapters and we may convert (or rewrite) RxJava 3 on top of Java 9's features (VarHandles).
One of my fears is that once Java 9 is out, many will try to write their own libraries (individuals, companies) and the "market" gets diluted with big branded-low quality solutions, not to mention the increased amount of "how does RxJava differ from X" questions.
My personal opinion is that this is already happening today around Reactive Streams where certain libraries and frameworks advertise themselves as RS but fail to deliver based on it. My (likely biased) conjecture is that RxJava 2 is the closest library/technology to an optimal Reactive Streams-based solution that can be.
InfoQ: What's next for RxJava?
Karnok: We had fantastic reviewers, such as Jake Wharton, during the development of RxJava 2. Unfortunately, the developer previews and release candidates didn't generate enough attention and despite our efforts, small problems and oversights slipped into the final release. I don't expect major issues in the coming months but we will keep fixing both version 1 and 2 as well as occasionally adding new operators to support our user base. A few companion libraries, such as RxAndroid, now provide RxJava 2 compatible versions, but the majority of the other libraries don't yet or haven't yet decided how to go forward. In terms of RxJava, I plan to retire RxJava 1 within six months (i.e., only bugfixes then on), partly due to the increasing maintenance burden on my "one man army" for some time now; partly to "encourage" the others to switch to the RxJava 2 ecosystem. As for RxJava 3, I don't have any concrete plans yet. There are ongoing discussions about splitting the library along types or along backpressure support as well as making the so-called operator-fusion elements (which give a significant boost to our performance) a standard extension of the Reactive Streams specification.
1 note · View note
netmetic · 4 years ago
Text
Why Architects Need Tools for Apache Kafka Service Discovery, Auditing, and Topology Visualization
You’re out of control. I hate to be the bearer of bad news, but sometimes we need to hear the truth. You know Apache Kafka, you love Apache Kafka, but as your projects and architecture have evolved, it has left you in an uncomfortable situation. Despite its real-time streaming benefits, the lack of tooling for Kafka service discovery, a reliable audit tool, or a topology visualizer has led you to a place I call “Kafka Hell”. Let me explain how you got here in 4 simple, detrimental, and unfortunately unavoidable steps.
You learned of the benefits of EDA and/or Apache Kafka. Whether you came at it from a pure technology perspective, or because your users/customers demanded real-time access to data/insights, you recognized the benefits of being more real-time.
You had some small project implementation and success. You identified a use case you thought events would work well for, figured out the sources of information, and the new modern applications to go with it. Happy days!
You reused your existing event streams. Within your team, you made use of the one-to-many distribution pattern (publish/subscribe) and built more applications reusing existing streams. Sweetness!
You lost control. Then other developers started building event-driven applications and mayhem ensued. You had so many topics, partitions, consumer groups, connectors – all good things, but then the questions started: What streams are available? Which group or application should be able to consume which streams? Who owns each stream? How do you visualize this complexity? It’s a mess, am I right?
History Repeats Itself
As we moved away from SOAP-based web services and REST became the predominant methodology for application interactions, there was a moment when many organizations faced the same challenges we face today with EDA and Apache Kafka.
Back then, SOA’s maturity brought about tooling which supported the ability to author, manage, and govern your SOAP/WSDL-based APIs. The tooling was generally categorized as “Service Registry and Repository.” The user experience sucked, but I bet you know that already!
Enter REST. Organizations which were/are technical pioneers quickly adopted the RESTful methodology; but since the tooling ecosystem was immature, they faced challenges as they moved from a handful of RESTful services to a multitude of them.
Sound like what we face with Kafka today?
The answer to the original problem was the emergence of the “API management” ecosystem. Led by Mulesoft, Apigee, and Axway, API management tools provided the following key capabilities:
Runtime Gateway: A server that acts as an API front end. It receives API requests, enforces throttling and security policies, passes requests to the back-end service, and then passes the response back to the requester. The gateway can provide functionality to support authentication, authorization, security, audit, and regulatory compliance.
API Authoring and Publishing tools: A collection of tools that API providers use to document and define APIs (for instance, using the OpenAPI or RAML specifications); generate API documentation, govern API usage through access and usage policies for APIs; test and debug the execution of APIs, including security testing and automated generation of tests and test suites; deploy APIs into production, staging, and quality assurance environments; and coordinate the overall API lifecycle.
External/Developer Portal: A community site, typically branded by an API provider. It encapsulates information and functionality in a single convenient source for API users. This includes: documentation, tutorials, sample code, software development kits, and interactive API console and sandboxes for trials. A portal allows the ability to register to APIs and manage subscription keys (such as OAuth2, Client ID, and Client Secret) and obtain support from the API provider and user community. In addition, it provides the linkage into productivity tooling that enables developers to easily generate consistent clients and service implementations.
Reporting and Analytics: Performing analysis of API usage and load, such as: overall hits, completed transactions, number of data objects returned, amount of compute time, other resources consumed, volume of data transferred, etc. The information gathered by the reporting and analytics functionality can be used by the API provider to optimize the API offering within an organization’s overall continuous improvement process and for defining software service-level agreements for APIs.
Without these functions, we would have had chaos. I truly believe the momentum behind RESTful APIs would have died a slow, agonizing death without a way to manage and govern the overwhelming quantity of APIs. This reality would have led to constantly breaking API clients, security leaks, loss of sensitive information, and interested parties generally flying blind with respect to existing services. It would have been a dark and gloomy time.
We Need to Manage and Govern Event Streams the Way We Do APIs
I bet if we all had a dollar for every time our parents said, “You need to grow up,” when we were younger, we would all be millionaires. But that is exactly what we need to do as it relates to event streams, whether you are using Apache Kafka, Confluent, MSK, or any other streaming technology. If we take our queues (no pun intended) from the success of API management – and the API-led movement in general – we have a long way to go in the asynchronous, event streaming space.
Over the last few years, I have poured a lot of my professional energy into working with organizations who have deployed Apache Kafka into production, and who I would consider to be technical leaders within their space. What I have heard time and time again is that the use of Apache Kafka has spread like wildfire to the point where they no longer know what they have, and the stream consumption patterns are nearly 1 to 1. This means that while data is being processed in real time (which is great), they are not getting a good return on their investment. A stream only being consumed once is literally a 1 to 1 exchange, but the real value of EDA lies in being able to easily reuse existing real-time data assets, and that can only be done if they are managed and governed appropriately.
Another common complaint about Apache Kafka is the inability to understand and visualize the way in which event streams are flowing. Choreographing the business processes and functions with Apache Kafka has become difficult without a topology visualizer. One architect described it as the “fog of war” – events are being fired everywhere, but nobody knows where they are going or what they are doing.
Events in large enterprises rarely originate from a Kafka-native application; they usually come from a variety of legacy applications (systems of record, old JEE apps, etc.) or from new, modern, IoT sensors and web apps. Thus, we need end-to-end visibility in order to properly understand the event-driven enterprise.
We need to adopt the methodology as described by the key capabilities of an API management platform, but for the Kafka event streaming paradigm. We already have the equivalent of the API Gateway which is your Kafka broker, but are sorely lacking stream authoring and publishing tools, external/developer portals, and the reporting and analytics capabilities found in API management solutions today. Ironically, I would claim the complexity and decoupling that you find in a large organization’s EDA/Kafka ecosystem is more complex and harder to manage than synchronous APIs which is why we need an “event management” capability now more than ever!
Technical Debt and the Need for Kafka Service Discovery
I hope by now you’ve bought into the idea that you need to govern and manage your Kafka event streams like you do your RESTful APIs. Your next question is most like likely, “Sounds great Jonathan, but I don’t know what I even have, and I surely don’t want to have to figure it out myself!” And to that, I say, “Preach!” I have walked in your shoes and recognize that technical documentation always gets out of date and is too often forgotten as an application continues to evolve. This is the technical debt problem that can spiral out of control as your use of EDA and Kafka grows over time.
So, that is exactly why it is a requirement to automate Kafka service discovery so you can introspect what topics, partitions, consumer groups, and connectors are configured so that you can begin down the road to managing them like you do for your other APIs. Without the ability to determine the reality (what’s going on in runtime is reality, whether you like it or not), you can document what you think you have but it will never be the source of truth you can depend on.
A reliable Kafka service discovery tool with the requirements I listed above will be that source of truth you need.
Once you have discovered what you have with a Kafka service discovery tool, you’ll need to find a way to keep it consistent as things inevitably change. There needs to be continuous probing to ensure that as the applications and architecture change, the documentation is kept up to date and continues to reflect the runtime reality. This means that on a periodic basis, the Kafka service discovery tool needs to be run in order to audit and find changes, enabling you to decide if the change was intended or not. This will ensure the Kafka event streams documentation (which applications are producing and consuming each event stream) and the schemas are always consistent.
Thus, the path to solving the technical debt dilemma and design consistency problem with Apache Kafka is a Kafka service discovery tool.
The Future of Kafka Service Discovery
I hope I’ve given you a little insight into why you are struggling to manage and understand your Kafka streams and what kind of tools the industry will need to solve these particular pain points. Recognizing the problem is the first step in solving it!
Solace has been taking a proactive role in developing the capabilities I outlined above, specifically for Kafka users: authoring, developer portal, metrics, service discovery, audit tool, etc. I encourage you to stay tuned and let us know if you agree that this type of capability is sorely needed! I am confident that soon you will be enabled to manage and govern your Apache Kafka event streams like you do your APIs. And won’t that be exciting!
The post Why Architects Need Tools for Apache Kafka Service Discovery, Auditing, and Topology Visualization appeared first on Solace.
Why Architects Need Tools for Apache Kafka Service Discovery, Auditing, and Topology Visualization published first on https://jiohow.tumblr.com/
0 notes
csemntwinl3x0a1 · 7 years ago
Text
Apache Kafka and the four challenges of production machine learning systems
Apache Kafka and the four challenges of production machine learning systems
Untangling data pipelines with a streaming platform.
Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. The cartoon version of machine learning sounds quite easy: you feed in training data made up of examples of good and bad outcomes, and the computer automatically learns from these and spits out a model that can make similar predictions on new data not seen before. What could be easier, right?
Those with real experience building and deploying production systems built around machine learning know that, in fact, these systems are shockingly hard to build. This difficulty is not, for the most part, the algorithmic or mathematical complexities of machine learning algorithms. Creating such algorithms is difficult, to be sure, but the algorithm creation process is mostly done by academic researchers. Teams that use the algorithms in production systems almost always use off-the-shelf libraries of pre-built algorithms. Nor is the difficulty primarily in using the algorithms to generate a model, though learning to debug and improve machine learning models is a skill in its own right. Rather, the core difficulty is that building and deploying systems that use machine learning is very different from traditional software engineering, and as a result, it requires practices and architectures different from what teams with a traditional software engineering background are familiar with.
Most software engineers have a clear understanding of how to build production-ready software using traditional tools and practices. In essence, traditional software engineering is the practice of turning business logic explicitly specified by engineers or product managers into production software systems. We have developed many practices in how to test and validate this type of system. We know the portfolio of skills a team needs to design, implement, and operate them, as well as the methodologies that can work to evolve them over time. Building traditional software isn’t easy, but it is well understood.
Machine learning systems have some of the same requirements as traditional software systems. We want fast, reliable systems we can evolve over time. But a lot about the architecture, skills, and practices of building them is quite different.
My own experience with this came earlier in my career. I'd studied machine learning in school and done work on applying support vector machines, graphical models, and Bayesian statistical methods on geographical data sets. This was a good background in the theory of machine learning and statistics, but was surprisingly unhelpful about how one would build this kind of system in practice (probably because, at the time, very few people were). One set of experiences that drove this came from my time at LinkedIn, beginning in 2007. I had joined LinkedIn for the data: the opportunity to apply cutting-edge algorithms to social network data. I managed the team that, among other things, ran People You May Know and Similar Profiles products, both of which we wanted to move from ad hoc heuristics to machine learning models. Over time, LinkedIn built many such systems for problems as diverse as newsfeed relevance, advertising, standardizing profile data, and fighting fraud and abuse.
In doing this, we learned a lot about what was required to successfully build production machine learning systems, which I’ll describe in this post. But first, it’s worth getting specific about what makes it hard. Here are four particularly difficult challenges.
Challenge one: Machine learning systems use advanced analytical techniques in production software
The nature of machine learning and its core difference from traditional analytics is that it allows the automation of decision-making. Traditional reporting and analytics is generally an input for a human who would ultimately make and carry out the resulting decision manually. Machine learning is aimed at automatically producing an optimal decision.
The good news is that by taking the human out of the loop, many more decisions can be made: instead of making one global decision (that is often all humans have time for), automated decision-making allows making decisions continuously and dynamically in a personalized way as part of every customer experience on things that are far too messy for traditional, manually specified business rules.
The challenge this presents, though, is that it generally demands directly integrating sophisticated prediction algorithms into production software systems. Human-driven analytics has plenty of failures, but the impact is more limited as there is a human in the loop to sanity-check the result (or at least take the blame if there is a problem!). Machine learning systems that go bad, however, often have the possibility of serious negative impact to the business before the problem is discovered and remediated.
Challenge two: Integrating model builders and system builders
How do you build a system that requires a mixture of advanced analytics as well as traditional software engineering? This is one of the biggest organizational challenges in building teams to create this type of system. It is quite hard to find engineers who have expertise in both traditional production software engineering and also machine learning or statistics. Yet, not only are both of these skills needed, they need to be integrated very closely in a production-quality software system.
Complicating this, the model building toolset is rapidly evolving and will often change as the model itself evolves. Early quick and dirty work may be little more than a simple R or Python script. If scalability is a challenge, Spark or other big data tools may be used. As the modeling graduates from simple regression models to more advanced techniques, algorithm-specific frameworks like Keras or TensorFlow may be employed.
Even more than the difference in tools and skills, the methodology and values are often different between model builders and production software engineers. Production software engineers rigorously design and test systems that have well-defined behaviors in all cases. They tend to focus on performance, maintainability, and correctness. Model builders have a different definition of excellence. They tend to value enough software engineering to get their job done and no more. Their job is inherently more experimental: rather than specifying the right behavior and then implementing it, model builders need to experimentally gather possible inputs and test their impact on predictive accuracy. This group naturally learns to avoid a heavy up-front engineering process for these experiments, as most will fail and need to be removed.
The speed with which the model building team can generate hypotheses about new data or algorithmic tweaks that might produce improvements in predictive accuracy, and validate or invalidate these hypotheses, is what determines the progress of the team at driving predictive accuracy.
A fundamental issue in the architecture of a machine learning-powered application is how these two groups can interact productively: the model builders need to be able to iterate in an agile fashion on predictive modeling and the production software engineers need to integrate these model changes in a stable, scalable production system. How can the model builders be kept safe and agile when their very model is at the heart of a system that is an integral part of an always-on production software system?
Challenge three: The failure of QA and the importance of instrumentation
Normal software has a well-defined notion of correctness that is mostly independent of the data it will run on. This lets new code be written, tested, and shipped from a development to production environment with reasonable confidence that it will be correct.
Machine learning systems are simply not well validated by traditional QA practices. A machine learning model takes input values and generally produces a number as output. Whether the system is working well or poorly, the output will still be a number. Worse, it is mostly impossible to specify correctness for a single input and the corresponding output, the core practice in traditional QA. The very notion of correctness for machine learning is statistical and defined with respect not just to the code that runs it, but to the input data on which it is run. Unit and integration tests, the backbone of modern approaches to quality assurance in traditional software engineering, are often quite useless in this domain.
The result is that a team building a machine learning application must lean very heavily on instrumentation of the running system to understand the behavior of their application in production, and they must have access to real data to build and test against, prior to production deployment of any model changes. What inputs did the application have? What prediction did it make? What did the ground truth turn out to be? Recording these facts and analyzing them both retrospectively and in real-time against the running system is the key to detecting production issues quickly and limiting their impact.
In many ways, this is an extension of a trend in software engineering that has become quite pronounced over the last 10 years. When I started my career, software was heavily QA'd and then shipped to production with very little ability to follow its behavior there. Modern software engineering still values testing (though hopefully with a bit more automation), but it also uses logging and metrics much more heavily to track the performance of systems in real time.
This discipline of "observability" is even more important in machine learning systems. If modern systems are piloted by a combination of "looking at the instruments" and "looking out the window," machine learning systems quickly must be flown almost entirely by the instruments. You cannot easily say if one bad result is an indication of something bad or just the expected failure rate of an inherently imperfect prediction algorithm.
The quality and detail of the instrumentation is particularly important because the output of the machine learning model often changes the behavior of the system being measured. A different model creates a different customer experience, and the only way to say if that is better or worse than it was before is to perform rigorous A/B tests that evaluate each change in the model and allow you to measure and analyze its performance across relevant segments.
Even this may not be enough. At LinkedIn, we measured improvements in performance only over a short time. One surprise we found was that there was a novelty effect: even adding random noise to the prediction could improve it in the short term because it produced novel results, but this would decay over time. This led to a series of improvements: tracking and using the novelty of the results that had been shown, feeding the prior exposures into the model, and changes in our procedures to run our experiments over a longer period of time.
Challenge four: Diverse data dependencies
Many of the fundamental difficulties in machine learning systems come from their hunger for data. Of course, all software is about data, but a traditional application needs only the data it uses to perform its business logic, or that which it intends to display on the screen. Machine learning models have a much more voracious hunger for data. The more data, the better! This can mean more examples to train your model with, but often even more useful is more features to include in the model. This necessitates a process of gathering diverse data sets from across the company (or even externally) to bring to bear all the features and examples that can help to improve predictions.
This leads to a software system built on top of data pulled from every corner of the business. Worse, it is rarely the case that the raw data is the signal that works best to model. Most often, significant cleanup and processing is required to get the data into a form that is most effective as input for the model.
The result is a tangle of data pipelines, spanning the company, often with very non-trivial processing logic, and a model that depends (in sensitive and non-obvious ways) on all these data feeds. This can be incredibly difficult to maintain with production SLAs. A group at Google has written a good summary of the issues with these data pipelines with the memorable title "Machine Learning: The High Interest Credit Card of Software Development." The majority of their difficulties center around these pipelines and the implicit dependencies and feedback loops they create.
It is critical that the data set the model is built off of and the data set that the model is eventually applied to are as close as possible. Any attempt to build a model off of one data set, say pulled from a data warehouse, in a lab setting, and then apply that to a production setting where the underlying data may be subtly different is likely to run into intractable difficulties due to this difference.
Our experience with the People You May Know algorithm at LinkedIn highlights this. The single biggest cause in fragility in the system was the data pipelines that fed the algorithm. These pipelines aggregated data across dozens of application domains, and, hence, were subject to unexpected upstream changes whenever new application changes were made, often unannounced to us and in very different areas of the company. The complexity of the pipelines themselves were a liability. There was one year where the single biggest relevance improvement made all year was not an improvement in the machine learning algorithm used or a new feature added to the model, but rather fixing a bug in a particularly complex data pipeline that allowed the resulting feature to have much better coverage and be used in more predictions. It turned out this bug had been in place since the very first version of the system, subtly degrading predictive accuracy, unnoticed the entire time.
What is to be done?
Monica Rogati, one of LinkedIn’s first data scientists, does an excellent job of summarizing the difficulties of building machine learning and AI systems in her recent article “The AI Hierarchy of Needs.” Just like Maslow characterizes human needs in a layered pyramid going from the most basic (food, clothing, and shelter) to the highest level (self-actualization), Monica describes a similar hierarchy for AI and machine learning projects. Her hierarchy looks like this:
Figure 1. Figure courtesy of Monica Rogati, used with permission.
The first few layers are among the biggest sources of struggle for companies first attempting machine learning projects. They lack a structured, scalable, reliable way of doing data collection, and as a result, the data intended for modeling is often incomplete, wrong, out-of-date, not available, or sprinkled across a dozen different systems.
The good news is that while the problem of integrating machine learning into business problems is very particular to the problem you are trying to solve, the collection of data is much more similar across companies. By attacking this common problem, we can start to address some of these challenges.
Apache Kafka as a universal data pipeline
Apache Kafka is a technology that came out of LinkedIn around the same time that the work I described was being done on data products. It was inspired by a number of challenges in using the data LinkedIn had, but one big motivation was the difficulty in building data-driven, machine learning-powered products and the complexity of all the data pipelines required to feed them.
This went through many iterations, but we came to an architecture that looked like this:
Figure 2. Figure courtesy of Jay Kreps.
The essence of this architecture is that it uses Kafka as an intermediary between the various data sources from which feature data is collected, the model-building environment where the model is fit, and the production application that serves predictions.
Feature data is pulled into Kafka from the various apps and databases that host it. This data is used to build models. The environment for this will vary based on the skills and preferred toolset of the team. The model building could be a data warehouse, a big data environment like Spark or Hadoop, or a simple server running Python scripts. The model can be published where the production app that gets the same model parameters can apply it to incoming examples (perhaps using Kafka Streams, an integrated stream processing layer, to help index the feature data for easy usage on demand).
How does this architecture help to address the four challenges I described? Let’s go through each challenge in turn.
Kafka decouples model building and serving
The first challenge I described was the difficulty of building production-grade software that depends on complex analytics. The advantage of this architecture is that it helps to segregate the complex environment in which the model building takes place from the production application in which the model is applied.
Typically, the production application needs to be fairly hardened from day one, as it sits directly in the customer experience. However the model building is often done only periodically, say daily or weekly, and may even be only partially automated in initial versions. This architecture gives each of these areas a crisp contract to the other: they communicate through streams of data published to Kafka. This allows the model building to evolve through different toolsets, algorithms, etc., independent of the production application.
Importantly, Kafka’s pub/sub model allows forking the data stream so the data seen in the production application is exactly the same stream given to the model building environment.
A concrete example of this decoupling is the evolution that often happens in the freshness of the model. Often, early versions of the model are built in a semi-manual manner and refreshed monthly or weekly. However, training off up-to-date data is important for accuracy, so the retraining generally becomes more frequent as the application proves its worth. The end state for some of these models is a system that reacts and retrains itself with much lower latency: either periodically throughout the day or, in particularly dynamic problem domains, continuously. The advantage of this architecture is that the model building area can go through this evolution independently of the application that takes advantage of the model.
Kafka helps integrate model builders and system builders
Segregating the environment for model building from the production application and forking off the same feed of data to both allows the model builders freedom to embrace a toolset and process that wouldn’t be appropriate for production application development. The different groups can use the tools most appropriate for their domain and still integrate around shared data feeds that are identical in each environment.
Because of the segregation, the model builders can’t break the production application. The worst outcome they can produce is to allow the deployment of a more poorly performing model. Even this can be remediated by having automated A/B tests or bandit algorithms that control the rollout of new model changes so that a bad model can only impact a small portion of the user base.
Kafka acts as the pipeline for instrumentation
The next challenge was developing the level of instrumentation and measurement sufficient to judge the performance of the production application and being able to use that detailed feedback data in model building.
The same Kafka pipeline that delivers feature data and model updates can be used to capture the stream of events that records what happens when a model is applied. Kafka is scalable enough to handle even very large data streams. For example, many of the machine learning applications in LinkedIn would record not only every decision they made, but also contextual information about the feature data that lead to that decision and the alternative decisions that had lower scores. This vast firehose of data was captured and used for evaluating the performance, A/B testing new models, and gathering data to retrain.
The ability to handle both core feature data about the input to the algorithms as well as the event data recording what happened over the same pipeline significantly simplifies the integration problem of using all this data.
Kafka helps tame diverse data dependencies
The unification of input data with the instrumentation I just described is a special case of the general problem of depending on a diverse set of upstream data sources across the business. Machine learning models need a vast array of input data—data from sensors, customer attributes from databases, event data captured from the user experience, not to mention instrumentation recording the results of applying the algorithm. Each of these likely needs significant processing before it is usable for model building or prediction. Each of these data types is different, but building custom integration for each leads to a mess of pipelines and dependencies.
Typical approaches for capturing feature data take an ETL-like approach, where a central team tries to scrape data from each source system with relevant attributes, munge these into a usable form, and then build on top of it.
Centralizing all these data feeds on Kafka gives a uniform layer that abstracts away details of the source systems and ensures the same data feeds go to both the model building environment and the production environment. This also gives a clear contract to source systems: publish your data in a well-specified format so it can be consumed in these environments. Frameworks like Kafka Connect can help integrate with many databases and off-the-shelf systems to make this integration easy.
This also allows you to evolve the freshness of the feature data used by the algorithm. Early versions of a feature may be scripted together in a one-off batch fashion and published out to test the effectiveness. If the feature proves predictive, it can be implemented as a continuous stream-processing transformation that keeps the data the algorithm uses for prediction always in sync with source systems. This can all be done without reworking the application itself.
The machine learning app and the streaming platform
This architecture doesn’t solve all the problems in building machine learning applications—it still remains really hard. What it does do is help to provide a path to solidifying the bottom layer of the hierarchy of needs and solidifying your data pipelines, which, after all, are the “food/clothing/shelter” of the machine learning application. There are still many hard problems remaining in the layers above.
I’ve shown this architecture in the context of one application, and usually that is what you start with. Over time, though, this pattern can grow into a general purpose architecture that we call a streaming platform. Having a central platform prepopulated with streams of data obviously makes additional machine learning applications easier to build, but also enables asynchronous event-driven microservices and many other such use cases.
This is our driving..
http://ift.tt/2gk17Jw
0 notes
repmywind02199 · 7 years ago
Text
Apache Kafka and the four challenges of production machine learning systems
Apache Kafka and the four challenges of production machine learning systems
Untangling data pipelines with a streaming platform.
Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. The cartoon version of machine learning sounds quite easy: you feed in training data made up of examples of good and bad outcomes, and the computer automatically learns from these and spits out a model that can make similar predictions on new data not seen before. What could be easier, right?
Those with real experience building and deploying production systems built around machine learning know that, in fact, these systems are shockingly hard to build. This difficulty is not, for the most part, the algorithmic or mathematical complexities of machine learning algorithms. Creating such algorithms is difficult, to be sure, but the algorithm creation process is mostly done by academic researchers. Teams that use the algorithms in production systems almost always use off-the-shelf libraries of pre-built algorithms. Nor is the difficulty primarily in using the algorithms to generate a model, though learning to debug and improve machine learning models is a skill in its own right. Rather, the core difficulty is that building and deploying systems that use machine learning is very different from traditional software engineering, and as a result, it requires practices and architectures different from what teams with a traditional software engineering background are familiar with.
Most software engineers have a clear understanding of how to build production-ready software using traditional tools and practices. In essence, traditional software engineering is the practice of turning business logic explicitly specified by engineers or product managers into production software systems. We have developed many practices in how to test and validate this type of system. We know the portfolio of skills a team needs to design, implement, and operate them, as well as the methodologies that can work to evolve them over time. Building traditional software isn’t easy, but it is well understood.
Machine learning systems have some of the same requirements as traditional software systems. We want fast, reliable systems we can evolve over time. But a lot about the architecture, skills, and practices of building them is quite different.
My own experience with this came earlier in my career. I'd studied machine learning in school and done work on applying support vector machines, graphical models, and Bayesian statistical methods on geographical data sets. This was a good background in the theory of machine learning and statistics, but was surprisingly unhelpful about how one would build this kind of system in practice (probably because, at the time, very few people were). One set of experiences that drove this came from my time at LinkedIn, beginning in 2007. I had joined LinkedIn for the data: the opportunity to apply cutting-edge algorithms to social network data. I managed the team that, among other things, ran People You May Know and Similar Profiles products, both of which we wanted to move from ad hoc heuristics to machine learning models. Over time, LinkedIn built many such systems for problems as diverse as newsfeed relevance, advertising, standardizing profile data, and fighting fraud and abuse.
In doing this, we learned a lot about what was required to successfully build production machine learning systems, which I’ll describe in this post. But first, it’s worth getting specific about what makes it hard. Here are four particularly difficult challenges.
Challenge one: Machine learning systems use advanced analytical techniques in production software
The nature of machine learning and its core difference from traditional analytics is that it allows the automation of decision-making. Traditional reporting and analytics is generally an input for a human who would ultimately make and carry out the resulting decision manually. Machine learning is aimed at automatically producing an optimal decision.
The good news is that by taking the human out of the loop, many more decisions can be made: instead of making one global decision (that is often all humans have time for), automated decision-making allows making decisions continuously and dynamically in a personalized way as part of every customer experience on things that are far too messy for traditional, manually specified business rules.
The challenge this presents, though, is that it generally demands directly integrating sophisticated prediction algorithms into production software systems. Human-driven analytics has plenty of failures, but the impact is more limited as there is a human in the loop to sanity-check the result (or at least take the blame if there is a problem!). Machine learning systems that go bad, however, often have the possibility of serious negative impact to the business before the problem is discovered and remediated.
Challenge two: Integrating model builders and system builders
How do you build a system that requires a mixture of advanced analytics as well as traditional software engineering? This is one of the biggest organizational challenges in building teams to create this type of system. It is quite hard to find engineers who have expertise in both traditional production software engineering and also machine learning or statistics. Yet, not only are both of these skills needed, they need to be integrated very closely in a production-quality software system.
Complicating this, the model building toolset is rapidly evolving and will often change as the model itself evolves. Early quick and dirty work may be little more than a simple R or Python script. If scalability is a challenge, Spark or other big data tools may be used. As the modeling graduates from simple regression models to more advanced techniques, algorithm-specific frameworks like Keras or TensorFlow may be employed.
Even more than the difference in tools and skills, the methodology and values are often different between model builders and production software engineers. Production software engineers rigorously design and test systems that have well-defined behaviors in all cases. They tend to focus on performance, maintainability, and correctness. Model builders have a different definition of excellence. They tend to value enough software engineering to get their job done and no more. Their job is inherently more experimental: rather than specifying the right behavior and then implementing it, model builders need to experimentally gather possible inputs and test their impact on predictive accuracy. This group naturally learns to avoid a heavy up-front engineering process for these experiments, as most will fail and need to be removed.
The speed with which the model building team can generate hypotheses about new data or algorithmic tweaks that might produce improvements in predictive accuracy, and validate or invalidate these hypotheses, is what determines the progress of the team at driving predictive accuracy.
A fundamental issue in the architecture of a machine learning-powered application is how these two groups can interact productively: the model builders need to be able to iterate in an agile fashion on predictive modeling and the production software engineers need to integrate these model changes in a stable, scalable production system. How can the model builders be kept safe and agile when their very model is at the heart of a system that is an integral part of an always-on production software system?
Challenge three: The failure of QA and the importance of instrumentation
Normal software has a well-defined notion of correctness that is mostly independent of the data it will run on. This lets new code be written, tested, and shipped from a development to production environment with reasonable confidence that it will be correct.
Machine learning systems are simply not well validated by traditional QA practices. A machine learning model takes input values and generally produces a number as output. Whether the system is working well or poorly, the output will still be a number. Worse, it is mostly impossible to specify correctness for a single input and the corresponding output, the core practice in traditional QA. The very notion of correctness for machine learning is statistical and defined with respect not just to the code that runs it, but to the input data on which it is run. Unit and integration tests, the backbone of modern approaches to quality assurance in traditional software engineering, are often quite useless in this domain.
The result is that a team building a machine learning application must lean very heavily on instrumentation of the running system to understand the behavior of their application in production, and they must have access to real data to build and test against, prior to production deployment of any model changes. What inputs did the application have? What prediction did it make? What did the ground truth turn out to be? Recording these facts and analyzing them both retrospectively and in real-time against the running system is the key to detecting production issues quickly and limiting their impact.
In many ways, this is an extension of a trend in software engineering that has become quite pronounced over the last 10 years. When I started my career, software was heavily QA'd and then shipped to production with very little ability to follow its behavior there. Modern software engineering still values testing (though hopefully with a bit more automation), but it also uses logging and metrics much more heavily to track the performance of systems in real time.
This discipline of "observability" is even more important in machine learning systems. If modern systems are piloted by a combination of "looking at the instruments" and "looking out the window," machine learning systems quickly must be flown almost entirely by the instruments. You cannot easily say if one bad result is an indication of something bad or just the expected failure rate of an inherently imperfect prediction algorithm.
The quality and detail of the instrumentation is particularly important because the output of the machine learning model often changes the behavior of the system being measured. A different model creates a different customer experience, and the only way to say if that is better or worse than it was before is to perform rigorous A/B tests that evaluate each change in the model and allow you to measure and analyze its performance across relevant segments.
Even this may not be enough. At LinkedIn, we measured improvements in performance only over a short time. One surprise we found was that there was a novelty effect: even adding random noise to the prediction could improve it in the short term because it produced novel results, but this would decay over time. This led to a series of improvements: tracking and using the novelty of the results that had been shown, feeding the prior exposures into the model, and changes in our procedures to run our experiments over a longer period of time.
Challenge four: Diverse data dependencies
Many of the fundamental difficulties in machine learning systems come from their hunger for data. Of course, all software is about data, but a traditional application needs only the data it uses to perform its business logic, or that which it intends to display on the screen. Machine learning models have a much more voracious hunger for data. The more data, the better! This can mean more examples to train your model with, but often even more useful is more features to include in the model. This necessitates a process of gathering diverse data sets from across the company (or even externally) to bring to bear all the features and examples that can help to improve predictions.
This leads to a software system built on top of data pulled from every corner of the business. Worse, it is rarely the case that the raw data is the signal that works best to model. Most often, significant cleanup and processing is required to get the data into a form that is most effective as input for the model.
The result is a tangle of data pipelines, spanning the company, often with very non-trivial processing logic, and a model that depends (in sensitive and non-obvious ways) on all these data feeds. This can be incredibly difficult to maintain with production SLAs. A group at Google has written a good summary of the issues with these data pipelines with the memorable title "Machine Learning: The High Interest Credit Card of Software Development." The majority of their difficulties center around these pipelines and the implicit dependencies and feedback loops they create.
It is critical that the data set the model is built off of and the data set that the model is eventually applied to are as close as possible. Any attempt to build a model off of one data set, say pulled from a data warehouse, in a lab setting, and then apply that to a production setting where the underlying data may be subtly different is likely to run into intractable difficulties due to this difference.
Our experience with the People You May Know algorithm at LinkedIn highlights this. The single biggest cause in fragility in the system was the data pipelines that fed the algorithm. These pipelines aggregated data across dozens of application domains, and, hence, were subject to unexpected upstream changes whenever new application changes were made, often unannounced to us and in very different areas of the company. The complexity of the pipelines themselves were a liability. There was one year where the single biggest relevance improvement made all year was not an improvement in the machine learning algorithm used or a new feature added to the model, but rather fixing a bug in a particularly complex data pipeline that allowed the resulting feature to have much better coverage and be used in more predictions. It turned out this bug had been in place since the very first version of the system, subtly degrading predictive accuracy, unnoticed the entire time.
What is to be done?
Monica Rogati, one of LinkedIn’s first data scientists, does an excellent job of summarizing the difficulties of building machine learning and AI systems in her recent article “The AI Hierarchy of Needs.” Just like Maslow characterizes human needs in a layered pyramid going from the most basic (food, clothing, and shelter) to the highest level (self-actualization), Monica describes a similar hierarchy for AI and machine learning projects. Her hierarchy looks like this:
Figure 1. Figure courtesy of Monica Rogati, used with permission.
The first few layers are among the biggest sources of struggle for companies first attempting machine learning projects. They lack a structured, scalable, reliable way of doing data collection, and as a result, the data intended for modeling is often incomplete, wrong, out-of-date, not available, or sprinkled across a dozen different systems.
The good news is that while the problem of integrating machine learning into business problems is very particular to the problem you are trying to solve, the collection of data is much more similar across companies. By attacking this common problem, we can start to address some of these challenges.
Apache Kafka as a universal data pipeline
Apache Kafka is a technology that came out of LinkedIn around the same time that the work I described was being done on data products. It was inspired by a number of challenges in using the data LinkedIn had, but one big motivation was the difficulty in building data-driven, machine learning-powered products and the complexity of all the data pipelines required to feed them.
This went through many iterations, but we came to an architecture that looked like this:
Figure 2. Figure courtesy of Jay Kreps.
The essence of this architecture is that it uses Kafka as an intermediary between the various data sources from which feature data is collected, the model-building environment where the model is fit, and the production application that serves predictions.
Feature data is pulled into Kafka from the various apps and databases that host it. This data is used to build models. The environment for this will vary based on the skills and preferred toolset of the team. The model building could be a data warehouse, a big data environment like Spark or Hadoop, or a simple server running Python scripts. The model can be published where the production app that gets the same model parameters can apply it to incoming examples (perhaps using Kafka Streams, an integrated stream processing layer, to help index the feature data for easy usage on demand).
How does this architecture help to address the four challenges I described? Let’s go through each challenge in turn.
Kafka decouples model building and serving
The first challenge I described was the difficulty of building production-grade software that depends on complex analytics. The advantage of this architecture is that it helps to segregate the complex environment in which the model building takes place from the production application in which the model is applied.
Typically, the production application needs to be fairly hardened from day one, as it sits directly in the customer experience. However the model building is often done only periodically, say daily or weekly, and may even be only partially automated in initial versions. This architecture gives each of these areas a crisp contract to the other: they communicate through streams of data published to Kafka. This allows the model building to evolve through different toolsets, algorithms, etc., independent of the production application.
Importantly, Kafka’s pub/sub model allows forking the data stream so the data seen in the production application is exactly the same stream given to the model building environment.
A concrete example of this decoupling is the evolution that often happens in the freshness of the model. Often, early versions of the model are built in a semi-manual manner and refreshed monthly or weekly. However, training off up-to-date data is important for accuracy, so the retraining generally becomes more frequent as the application proves its worth. The end state for some of these models is a system that reacts and retrains itself with much lower latency: either periodically throughout the day or, in particularly dynamic problem domains, continuously. The advantage of this architecture is that the model building area can go through this evolution independently of the application that takes advantage of the model.
Kafka helps integrate model builders and system builders
Segregating the environment for model building from the production application and forking off the same feed of data to both allows the model builders freedom to embrace a toolset and process that wouldn’t be appropriate for production application development. The different groups can use the tools most appropriate for their domain and still integrate around shared data feeds that are identical in each environment.
Because of the segregation, the model builders can’t break the production application. The worst outcome they can produce is to allow the deployment of a more poorly performing model. Even this can be remediated by having automated A/B tests or bandit algorithms that control the rollout of new model changes so that a bad model can only impact a small portion of the user base.
Kafka acts as the pipeline for instrumentation
The next challenge was developing the level of instrumentation and measurement sufficient to judge the performance of the production application and being able to use that detailed feedback data in model building.
The same Kafka pipeline that delivers feature data and model updates can be used to capture the stream of events that records what happens when a model is applied. Kafka is scalable enough to handle even very large data streams. For example, many of the machine learning applications in LinkedIn would record not only every decision they made, but also contextual information about the feature data that lead to that decision and the alternative decisions that had lower scores. This vast firehose of data was captured and used for evaluating the performance, A/B testing new models, and gathering data to retrain.
The ability to handle both core feature data about the input to the algorithms as well as the event data recording what happened over the same pipeline significantly simplifies the integration problem of using all this data.
Kafka helps tame diverse data dependencies
The unification of input data with the instrumentation I just described is a special case of the general problem of depending on a diverse set of upstream data sources across the business. Machine learning models need a vast array of input data—data from sensors, customer attributes from databases, event data captured from the user experience, not to mention instrumentation recording the results of applying the algorithm. Each of these likely needs significant processing before it is usable for model building or prediction. Each of these data types is different, but building custom integration for each leads to a mess of pipelines and dependencies.
Typical approaches for capturing feature data take an ETL-like approach, where a central team tries to scrape data from each source system with relevant attributes, munge these into a usable form, and then build on top of it.
Centralizing all these data feeds on Kafka gives a uniform layer that abstracts away details of the source systems and ensures the same data feeds go to both the model building environment and the production environment. This also gives a clear contract to source systems: publish your data in a well-specified format so it can be consumed in these environments. Frameworks like Kafka Connect can help integrate with many databases and off-the-shelf systems to make this integration easy.
This also allows you to evolve the freshness of the feature data used by the algorithm. Early versions of a feature may be scripted together in a one-off batch fashion and published out to test the effectiveness. If the feature proves predictive, it can be implemented as a continuous stream-processing transformation that keeps the data the algorithm uses for prediction always in sync with source systems. This can all be done without reworking the application itself.
The machine learning app and the streaming platform
This architecture doesn’t solve all the problems in building machine learning applications—it still remains really hard. What it does do is help to provide a path to solidifying the bottom layer of the hierarchy of needs and solidifying your data pipelines, which, after all, are the “food/clothing/shelter” of the machine learning application. There are still many hard problems remaining in the layers above.
I’ve shown this architecture in the context of one application, and usually that is what you start with. Over time, though, this pattern can grow into a general purpose architecture that we call a streaming platform. Having a central platform prepopulated with streams of data obviously makes additional machine learning applications easier to build, but also enables asynchronous event-driven microservices and many other such use cases.
This is our driving..
http://ift.tt/2gk17Jw
0 notes
doorrepcal33169 · 7 years ago
Text
Apache Kafka and the four challenges of production machine learning systems
Untangling data pipelines with a streaming platform.
Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. The cartoon version of machine learning sounds quite easy: you feed in training data made up of examples of good and bad outcomes, and the computer automatically learns from these and spits out a model that can make similar predictions on new data not seen before. What could be easier, right?
Those with real experience building and deploying production systems built around machine learning know that, in fact, these systems are shockingly hard to build. This difficulty is not, for the most part, the algorithmic or mathematical complexities of machine learning algorithms. Creating such algorithms is difficult, to be sure, but the algorithm creation process is mostly done by academic researchers. Teams that use the algorithms in production systems almost always use off-the-shelf libraries of pre-built algorithms. Nor is the difficulty primarily in using the algorithms to generate a model, though learning to debug and improve machine learning models is a skill in its own right. Rather, the core difficulty is that building and deploying systems that use machine learning is very different from traditional software engineering, and as a result, it requires practices and architectures different from what teams with a traditional software engineering background are familiar with.
Most software engineers have a clear understanding of how to build production-ready software using traditional tools and practices. In essence, traditional software engineering is the practice of turning business logic explicitly specified by engineers or product managers into production software systems. We have developed many practices in how to test and validate this type of system. We know the portfolio of skills a team needs to design, implement, and operate them, as well as the methodologies that can work to evolve them over time. Building traditional software isn’t easy, but it is well understood.
Machine learning systems have some of the same requirements as traditional software systems. We want fast, reliable systems we can evolve over time. But a lot about the architecture, skills, and practices of building them is quite different.
My own experience with this came earlier in my career. I'd studied machine learning in school and done work on applying support vector machines, graphical models, and Bayesian statistical methods on geographical data sets. This was a good background in the theory of machine learning and statistics, but was surprisingly unhelpful about how one would build this kind of system in practice (probably because, at the time, very few people were). One set of experiences that drove this came from my time at LinkedIn, beginning in 2007. I had joined LinkedIn for the data: the opportunity to apply cutting-edge algorithms to social network data. I managed the team that, among other things, ran People You May Know and Similar Profiles products, both of which we wanted to move from ad hoc heuristics to machine learning models. Over time, LinkedIn built many such systems for problems as diverse as newsfeed relevance, advertising, standardizing profile data, and fighting fraud and abuse.
In doing this, we learned a lot about what was required to successfully build production machine learning systems, which I’ll describe in this post. But first, it’s worth getting specific about what makes it hard. Here are four particularly difficult challenges.
Challenge one: Machine learning systems use advanced analytical techniques in production software
The nature of machine learning and its core difference from traditional analytics is that it allows the automation of decision-making. Traditional reporting and analytics is generally an input for a human who would ultimately make and carry out the resulting decision manually. Machine learning is aimed at automatically producing an optimal decision.
The good news is that by taking the human out of the loop, many more decisions can be made: instead of making one global decision (that is often all humans have time for), automated decision-making allows making decisions continuously and dynamically in a personalized way as part of every customer experience on things that are far too messy for traditional, manually specified business rules.
The challenge this presents, though, is that it generally demands directly integrating sophisticated prediction algorithms into production software systems. Human-driven analytics has plenty of failures, but the impact is more limited as there is a human in the loop to sanity-check the result (or at least take the blame if there is a problem!). Machine learning systems that go bad, however, often have the possibility of serious negative impact to the business before the problem is discovered and remediated.
Challenge two: Integrating model builders and system builders
How do you build a system that requires a mixture of advanced analytics as well as traditional software engineering? This is one of the biggest organizational challenges in building teams to create this type of system. It is quite hard to find engineers who have expertise in both traditional production software engineering and also machine learning or statistics. Yet, not only are both of these skills needed, they need to be integrated very closely in a production-quality software system.
Complicating this, the model building toolset is rapidly evolving and will often change as the model itself evolves. Early quick and dirty work may be little more than a simple R or Python script. If scalability is a challenge, Spark or other big data tools may be used. As the modeling graduates from simple regression models to more advanced techniques, algorithm-specific frameworks like Keras or TensorFlow may be employed.
Even more than the difference in tools and skills, the methodology and values are often different between model builders and production software engineers. Production software engineers rigorously design and test systems that have well-defined behaviors in all cases. They tend to focus on performance, maintainability, and correctness. Model builders have a different definition of excellence. They tend to value enough software engineering to get their job done and no more. Their job is inherently more experimental: rather than specifying the right behavior and then implementing it, model builders need to experimentally gather possible inputs and test their impact on predictive accuracy. This group naturally learns to avoid a heavy up-front engineering process for these experiments, as most will fail and need to be removed.
The speed with which the model building team can generate hypotheses about new data or algorithmic tweaks that might produce improvements in predictive accuracy, and validate or invalidate these hypotheses, is what determines the progress of the team at driving predictive accuracy.
A fundamental issue in the architecture of a machine learning-powered application is how these two groups can interact productively: the model builders need to be able to iterate in an agile fashion on predictive modeling and the production software engineers need to integrate these model changes in a stable, scalable production system. How can the model builders be kept safe and agile when their very model is at the heart of a system that is an integral part of an always-on production software system?
Challenge three: The failure of QA and the importance of instrumentation
Normal software has a well-defined notion of correctness that is mostly independent of the data it will run on. This lets new code be written, tested, and shipped from a development to production environment with reasonable confidence that it will be correct.
Machine learning systems are simply not well validated by traditional QA practices. A machine learning model takes input values and generally produces a number as output. Whether the system is working well or poorly, the output will still be a number. Worse, it is mostly impossible to specify correctness for a single input and the corresponding output, the core practice in traditional QA. The very notion of correctness for machine learning is statistical and defined with respect not just to the code that runs it, but to the input data on which it is run. Unit and integration tests, the backbone of modern approaches to quality assurance in traditional software engineering, are often quite useless in this domain.
The result is that a team building a machine learning application must lean very heavily on instrumentation of the running system to understand the behavior of their application in production, and they must have access to real data to build and test against, prior to production deployment of any model changes. What inputs did the application have? What prediction did it make? What did the ground truth turn out to be? Recording these facts and analyzing them both retrospectively and in real-time against the running system is the key to detecting production issues quickly and limiting their impact.
In many ways, this is an extension of a trend in software engineering that has become quite pronounced over the last 10 years. When I started my career, software was heavily QA'd and then shipped to production with very little ability to follow its behavior there. Modern software engineering still values testing (though hopefully with a bit more automation), but it also uses logging and metrics much more heavily to track the performance of systems in real time.
This discipline of "observability" is even more important in machine learning systems. If modern systems are piloted by a combination of "looking at the instruments" and "looking out the window," machine learning systems quickly must be flown almost entirely by the instruments. You cannot easily say if one bad result is an indication of something bad or just the expected failure rate of an inherently imperfect prediction algorithm.
The quality and detail of the instrumentation is particularly important because the output of the machine learning model often changes the behavior of the system being measured. A different model creates a different customer experience, and the only way to say if that is better or worse than it was before is to perform rigorous A/B tests that evaluate each change in the model and allow you to measure and analyze its performance across relevant segments.
Even this may not be enough. At LinkedIn, we measured improvements in performance only over a short time. One surprise we found was that there was a novelty effect: even adding random noise to the prediction could improve it in the short term because it produced novel results, but this would decay over time. This led to a series of improvements: tracking and using the novelty of the results that had been shown, feeding the prior exposures into the model, and changes in our procedures to run our experiments over a longer period of time.
Challenge four: Diverse data dependencies
Many of the fundamental difficulties in machine learning systems come from their hunger for data. Of course, all software is about data, but a traditional application needs only the data it uses to perform its business logic, or that which it intends to display on the screen. Machine learning models have a much more voracious hunger for data. The more data, the better! This can mean more examples to train your model with, but often even more useful is more features to include in the model. This necessitates a process of gathering diverse data sets from across the company (or even externally) to bring to bear all the features and examples that can help to improve predictions.
This leads to a software system built on top of data pulled from every corner of the business. Worse, it is rarely the case that the raw data is the signal that works best to model. Most often, significant cleanup and processing is required to get the data into a form that is most effective as input for the model.
The result is a tangle of data pipelines, spanning the company, often with very non-trivial processing logic, and a model that depends (in sensitive and non-obvious ways) on all these data feeds. This can be incredibly difficult to maintain with production SLAs. A group at Google has written a good summary of the issues with these data pipelines with the memorable title "Machine Learning: The High Interest Credit Card of Software Development." The majority of their difficulties center around these pipelines and the implicit dependencies and feedback loops they create.
It is critical that the data set the model is built off of and the data set that the model is eventually applied to are as close as possible. Any attempt to build a model off of one data set, say pulled from a data warehouse, in a lab setting, and then apply that to a production setting where the underlying data may be subtly different is likely to run into intractable difficulties due to this difference.
Our experience with the People You May Know algorithm at LinkedIn highlights this. The single biggest cause in fragility in the system was the data pipelines that fed the algorithm. These pipelines aggregated data across dozens of application domains, and, hence, were subject to unexpected upstream changes whenever new application changes were made, often unannounced to us and in very different areas of the company. The complexity of the pipelines themselves were a liability. There was one year where the single biggest relevance improvement made all year was not an improvement in the machine learning algorithm used or a new feature added to the model, but rather fixing a bug in a particularly complex data pipeline that allowed the resulting feature to have much better coverage and be used in more predictions. It turned out this bug had been in place since the very first version of the system, subtly degrading predictive accuracy, unnoticed the entire time.
What is to be done?
Monica Rogati, one of LinkedIn’s first data scientists, does an excellent job of summarizing the difficulties of building machine learning and AI systems in her recent article “The AI Hierarchy of Needs.” Just like Maslow characterizes human needs in a layered pyramid going from the most basic (food, clothing, and shelter) to the highest level (self-actualization), Monica describes a similar hierarchy for AI and machine learning projects. Her hierarchy looks like this:
Figure 1. Figure courtesy of Monica Rogati, used with permission.
The first few layers are among the biggest sources of struggle for companies first attempting machine learning projects. They lack a structured, scalable, reliable way of doing data collection, and as a result, the data intended for modeling is often incomplete, wrong, out-of-date, not available, or sprinkled across a dozen different systems.
The good news is that while the problem of integrating machine learning into business problems is very particular to the problem you are trying to solve, the collection of data is much more similar across companies. By attacking this common problem, we can start to address some of these challenges.
Apache Kafka as a universal data pipeline
Apache Kafka is a technology that came out of LinkedIn around the same time that the work I described was being done on data products. It was inspired by a number of challenges in using the data LinkedIn had, but one big motivation was the difficulty in building data-driven, machine learning-powered products and the complexity of all the data pipelines required to feed them.
This went through many iterations, but we came to an architecture that looked like this:
Figure 2. Figure courtesy of Jay Kreps.
The essence of this architecture is that it uses Kafka as an intermediary between the various data sources from which feature data is collected, the model-building environment where the model is fit, and the production application that serves predictions.
Feature data is pulled into Kafka from the various apps and databases that host it. This data is used to build models. The environment for this will vary based on the skills and preferred toolset of the team. The model building could be a data warehouse, a big data environment like Spark or Hadoop, or a simple server running Python scripts. The model can be published where the production app that gets the same model parameters can apply it to incoming examples (perhaps using Kafka Streams, an integrated stream processing layer, to help index the feature data for easy usage on demand).
How does this architecture help to address the four challenges I described? Let’s go through each challenge in turn.
Kafka decouples model building and serving
The first challenge I described was the difficulty of building production-grade software that depends on complex analytics. The advantage of this architecture is that it helps to segregate the complex environment in which the model building takes place from the production application in which the model is applied.
Typically, the production application needs to be fairly hardened from day one, as it sits directly in the customer experience. However the model building is often done only periodically, say daily or weekly, and may even be only partially automated in initial versions. This architecture gives each of these areas a crisp contract to the other: they communicate through streams of data published to Kafka. This allows the model building to evolve through different toolsets, algorithms, etc., independent of the production application.
Importantly, Kafka’s pub/sub model allows forking the data stream so the data seen in the production application is exactly the same stream given to the model building environment.
A concrete example of this decoupling is the evolution that often happens in the freshness of the model. Often, early versions of the model are built in a semi-manual manner and refreshed monthly or weekly. However, training off up-to-date data is important for accuracy, so the retraining generally becomes more frequent as the application proves its worth. The end state for some of these models is a system that reacts and retrains itself with much lower latency: either periodically throughout the day or, in particularly dynamic problem domains, continuously. The advantage of this architecture is that the model building area can go through this evolution independently of the application that takes advantage of the model.
Kafka helps integrate model builders and system builders
Segregating the environment for model building from the production application and forking off the same feed of data to both allows the model builders freedom to embrace a toolset and process that wouldn’t be appropriate for production application development. The different groups can use the tools most appropriate for their domain and still integrate around shared data feeds that are identical in each environment.
Because of the segregation, the model builders can’t break the production application. The worst outcome they can produce is to allow the deployment of a more poorly performing model. Even this can be remediated by having automated A/B tests or bandit algorithms that control the rollout of new model changes so that a bad model can only impact a small portion of the user base.
Kafka acts as the pipeline for instrumentation
The next challenge was developing the level of instrumentation and measurement sufficient to judge the performance of the production application and being able to use that detailed feedback data in model building.
The same Kafka pipeline that delivers feature data and model updates can be used to capture the stream of events that records what happens when a model is applied. Kafka is scalable enough to handle even very large data streams. For example, many of the machine learning applications in LinkedIn would record not only every decision they made, but also contextual information about the feature data that lead to that decision and the alternative decisions that had lower scores. This vast firehose of data was captured and used for evaluating the performance, A/B testing new models, and gathering data to retrain.
The ability to handle both core feature data about the input to the algorithms as well as the event data recording what happened over the same pipeline significantly simplifies the integration problem of using all this data.
Kafka helps tame diverse data dependencies
The unification of input data with the instrumentation I just described is a special case of the general problem of depending on a diverse set of upstream data sources across the business. Machine learning models need a vast array of input data—data from sensors, customer attributes from databases, event data captured from the user experience, not to mention instrumentation recording the results of applying the algorithm. Each of these likely needs significant processing before it is usable for model building or prediction. Each of these data types is different, but building custom integration for each leads to a mess of pipelines and dependencies.
Typical approaches for capturing feature data take an ETL-like approach, where a central team tries to scrape data from each source system with relevant attributes, munge these into a usable form, and then build on top of it.
Centralizing all these data feeds on Kafka gives a uniform layer that abstracts away details of the source systems and ensures the same data feeds go to both the model building environment and the production environment. This also gives a clear contract to source systems: publish your data in a well-specified format so it can be consumed in these environments. Frameworks like Kafka Connect can help integrate with many databases and off-the-shelf systems to make this integration easy.
This also allows you to evolve the freshness of the feature data used by the algorithm. Early versions of a feature may be scripted together in a one-off batch fashion and published out to test the effectiveness. If the feature proves predictive, it can be implemented as a continuous stream-processing transformation that keeps the data the algorithm uses for prediction always in sync with source systems. This can all be done without reworking the application itself.
The machine learning app and the streaming platform
This architecture doesn’t solve all the problems in building machine learning applications—it still remains really hard. What it does do is help to provide a path to solidifying the bottom layer of the hierarchy of needs and solidifying your data pipelines, which, after all, are the “food/clothing/shelter” of the machine learning application. There are still many hard problems remaining in the layers above.
I’ve shown this architecture in the context of one application, and usually that is what you start with. Over time, though, this pattern can grow into a general purpose architecture that we call a streaming platform. Having a central platform prepopulated with streams of data obviously makes additional machine learning applications easier to build, but also enables asynchronous event-driven microservices and many other such use cases.
This is our driving..
from FEED 10 TECHNOLOGY http://ift.tt/2gk17Jw
0 notes
cstesttaken · 8 years ago
Text
Magic Leap’s new marketing boss has a tough challenge: Trying to sell an amazing product you can’t see
How do you market a company that says it has a mind-blowing, world-changing product — that it can’t show you?
That’s the challenge facing Brenda Freeman, the chief marketing officer for Magic Leap, the much-hyped “mixed reality” startup that has raised $1.4 billion from Google, Alibaba and other investors.
Freeman started at Magic Leap this fall, replacing Samsung veteran Brian Wallace, who the company says was “terminated without cause” in September; Wallace is now working at a startup run by former Android boss Andy Rubin.
In the midst of the shuffle, a widely read story from The Information reported that Magic Leap was struggling to turn its technology — which lets users import animated characters into their field of vision — into a consumer product, in the form of glasses or goggles. That report has heightened industry skepticism that Magic Leap can deliver on sky-high claims and ambitions.
Magic Leap
I talked to her this week about the challenges she’s facing. Here’s an edited transcript of our conversation:
Peter Kafka: It seems like being a chief marketing officer at a company that has a lot of attention focused on it, but doesn’t have a product it can show off or really talk about in much detail, is a real challenge.
Brenda Freeman: I don’t think it’s a challenge as much as it forces the marketing strategy to perhaps pivot, until you actually have a product to experience. There’s marketing to the promise of what it is, and the fact that those who actually have experienced it, basically are amazed by it. But what we do in terms of creating early awareness and interest and intrigue is based on the promise of what it can do.
So the efforts in the beginning are more about educating that audience that we think is going to be actually interested in buying the product.
You mentioned a pivot. What are you pivoting from? It seems like you’re talking about what Magic Leap has been doing for more than a year: Showing a relatively small group of people the product who say that’s it’s amazing, but can’t go into details because they’ve signed NDAs.
I’d say the team has got a great start. The first thing you have to do is establish the brand. And I think the team did a really great job of creating a brand voice that’s unique in the marketplace. [We need to] make sure that we’re talking to the right audience, with a tone that’s befitting of the brand.
And as you know, at Magic Leap, we’re very much about the fact that it’s not hardware-first, it’s about using technology and what it can do for your life. So I think the idea of having a very humanistic approach to the overall marketing message has been actually very good.
What’s the most concise way to explain what Magic Leap is, to someone who hasn’t seen it?
It’s technology that is basically going to allow you to enhance your life. What we’re trying to do is — we’re on a mission. We want to create the best mixed-reality light field experience for the world. That’s how I would describe it.
When are consumers going to be able to touch this stuff?
As you can imagine, we feel really good about the fact that we’re on track. Our investors are very happy about the timelines that we’re working against. We feel really proud of that. We can’t actually say, exactly — we can’t share that yet publicly, but just know that we’re very much on time, and we’re on track.
But is it a year out? Two years out? Five years out?
We are racing toward launch. That’s why I was brought on board. My background is very much in the entertainment and content space and being able to drive marketing in an eventful sort of way — that’s why I was brought onto the team. We are racing toward launch. And we’re very much on track.
Is it frustrating that because this company has raised so much money, and because the initial descriptions of the product are so evocative, that it’s difficult for you to do your job, because you can’t let people see the product? It seems like you’re setting yourself up, because there could be a gap between what people are actually going to use and the expectations around that. How do you manage that?
I don’t believe that to be the case at all. I’ve actually experienced the product. That’s one of the reasons I decided to join. Because it’s all about amazing technology, but it’s also about the amazing content that’s going to be brought to life with this great technology.
I’m a left-brain, right-brain type of marketer. I was actually a chemical engineer; I actually designed rocket motors in my early career [at Atlantic Research Corp.] before I went into marketing. That’s why this is an amazing place to work, because it’s basically the best of both worlds.
The Information’s story talked about the gap between what you’re talking about and what you’re actually developing. I’ve heard similar things. Is that a fair description of where you guys are at?
I’m so glad that you asked question. We feel like that narrative that’s been created is just completely untrue. I think there’s been a lot of conversation about a video that was created. [Magic Leap’s] technology is optimized for the eye-brain system. And so it took a little bit of time to capture the right technique to capture what you experience through the system — to make that translate to video. So we released a concept video, which is very representative of what we can do. It’s nothing less than what all of our competition does as well.
youtube
But there’s also a sense the company’s ability to make a consumer product isn’t as far along as it needs to be. I talked to one of Magic Leap’s investors recently, and they said that [CEO Rony Abovitz] “has no focus.” They described a company that has really cool technology and is struggling to productize it.
I would say that’s completely untrue. The good news is we have a founder who’s a visionary, and he’s a creator. But he also is a left-brain, right-brain brilliant person. And he’s got the technical chops, and he hires the best of the best, in terms of building our hardware and our software systems.
So we are absolutely on track, our investors who come down on a regular basis have experienced our product, we’ve walked them through the timeline. Our timelines have not changed one iota. We are racing toward launch and we’re meeting our goals.
You’re replacing Brian Wallace. What are you doing differently than he did?
Absolutely. My point of view, in terms of how I market, is probably very different than Brian’s. I’m not a hardware-first type of marketer. I’m very much an emotive type of storyteller. It’s using the technology, and [explaining] how the technology is going to enhance my life. So it’s about bringing it to life in a very interesting, never-been-done-before type of way.
So it’s a very high bar. Never been done before.
Anything else we should know?
No, other than the fact that it’s an amazing team. And, quite frankly, sometimes there’s changes that have to be made. You may have heard about the fact that there were some changes on the team quite recently. And with new leadership, change is inevitable. But we’re very much about strengthening our team, and making sure that we have a culture of those that are entrepreneurial and scrappy.
Subscribe to Recode Newsletters
Source
http://www.recode.net/2016/12/22/14058314/magic-leap-marketing-brenda-freeman
0 notes
androidtechtk-blog · 8 years ago
Text
RX JAVA 2.0 releases support for Android
The RxJava team has released version 2.0 of their reactive Java framework, after an 18 month development cycle. RxJava is part of the ReactiveX family of libraries and frameworks, which is in their words, "a combination of the best ideas from the Observer pattern, the Iterator pattern, and functional programming". The project's "What's different in 2.0" is a good guide for developers already familiar with RxJava 1.x.
RxJava 2.0 is a brand new implementation of RxJava. This release is based on the Reactive Streams specification, an initiative for providing a standard for asynchronous stream processing with non-blocking back pressure, targeting runtime environments (JVM and JavaScript) as well as network protocols.
Reactive implementations have concepts of publishers and subscribers, as well as ways to subscribe to data streams, get the next stream of data, handle errors and close the connection.
The Reactive Streams spec will be included in JDK 9 as java.util.concurrent.Flow. The following interfaces correspond to the Reactive Streams spec. As you can see, the spec is small, consisting of just four interfaces:
·         Flow.Processor<T,R>: A component that acts as both a Subscriber and Publisher.
·         Flow.Publisher<T>: A producer of items (and related control messages) received by Subscribers.
·         Flow.Subscriber<T>: A receiver of messages.
·         Flow.Subscription: Message control linking a Flow.Publisher and Flow.Subscriber.
Spring Framework 5 is also going reactive. To see how this looks, refer to Josh Long's Functional Reactive Endpoints with Spring Framework 5.0.
To learn more about RxJava 2.0, InfoQ interviewed main RxJava 2.0 contributor, David Karnok.
InfoQ: First of all, congrats on RxJava 2.0! 18 months in the making, that's quite a feat. What are you most proud of in this release?
David Karnok: Thanks! In some way, I wish it didn't take so long. There was a 10 month pause when Ben Christensen, the original author who started RxJava, left and there was no one at Netflix to push this forward. I'm sure many will contest that things got way better when I took over the lead role this June. I'm proud my research into more advanced and more performant Reactive Streams paid off and RxJava 2 is the proof all of it works.
InfoQ: What’s different in RxJava 2.0 and how does it help developers?
Karnok: There are a lot of differences between version 1 and 2 and it's impossible to list them all here ,but you can visit the dedicated wiki page for a comprehensible explanation. In headlines, we now support the de-facto standard Reactive Streams specification, have significant overhead reduction in many places, no longer allow nulls, have split types into two groups based on support of or lack of backpressure and have explicit interop between the base reactive types.
InfoQ: Where do you see RxJava used the most (e.g. IoT, real-time data processing, etc.)?
Karnok: RxJava is more dominantly used by the Android community, based on the feedback I saw. I believe the server side is dominated by Project Reactor and Akka at the moment. I haven't specifically seen IoT mentioned or use RxJava (it requires Java), but maybe they use some other reactive library available on their platform. For real-time data processing people still tend to use other solutions, most of them not really reactive, and I'm not aware of any providers (maybe Pivotal) who are pushing for reactive in this area.
InfoQ: What benefits does RxJava provide Android more than other environments that would explain the increased traction?
Karnok: As far as I see, Android wanted to "help" their users solving async and concurrent problems with Android-only tools such as AsyncTask, Looper/Handler etc.
Unfortunately, their design and use is quite inconvenient, often hard to understand or predict and generally brings frustration to Android developers. These can largely contribute to callback hell and the difficulty of keeping async operations off the main thread.
RxJava's design (inherited from the ReactiveX design of Microsoft) is dataflow-oriented and orthogonalized where actions execute from when data appears for processing. In addition, error handling and reporting is a key part of the flows. With AsyncTask, you had to manually work out the error delivery pattern and cancel pending tasks, whereas RxJava does that as part of its contract.
In practical terms, having a flow that queries several services in the background and then presents the results in the main thread can be expressed in a few lines with RxJava (+Retrofit) and a simple screen rotation will cancel the service calls promptly.
This is a huge productivity win for Android developers, and the simplicity helps them climb the steep learning curve the whole reactive programming's paradigm shift requires. Maybe at the end, RxJava is so attractive to Android because it reduces the "time-to-market" for individual developers, startups and small companies in the mobile app business.
There was nothing of a comparable issue on the desktop/server side Java, in my opinion, at that time. People learned to fire up ExecutorService's and wait on Future.get(), knew about SwingUtilities.invokeLater to send data back to the GUI thread and otherwise the Servlet API, which is one thread per request only (pre 3.0) naturally favored blocking APIs (database, service calls).
Desktop/server folks are more interested in the performance benefits a non-blocking design of their services offers (rather than how easy one can write a service). However, unlike Android development, having just RxJava is not enough and many expect/need complete frameworks to assist their business logic as there is no "proper" standard for non-blocking web services to replace the Servlet API. (Yes, there is Spring (~Boot) and Play but they feel a bit bandwagon-y to me at the moment).
InfoQ: HTTP is a synchronous protocol and can cause a lot of back pressure when using microservices. Streaming platforms like Akka and Apache Kafka help to solve this. Does RxJava 2.0 do anything to allow automatic back pressure?
Karnok: RxJava 2's Flowable type implements the Reactive Streams interface/specification and does support backpressure. However, the Java level backpressure is quite different from the network level backpressure. For us, backpressure means how many objects to deliver through the pipeline between different stages where these objects can be non uniform in type and size. On the network level one deals with usually fixed size packets, and backpressure manifests via the lack of acknowledgement of previously sent packets. In classical setup, the network backpressure manifests on the Java level as blocking calls that don't return until all pieces of data have been written. There are non-blocking setups, such as Netty, where the blocking is replaced by implicit buffering, and as far as I know there are only individual, non-uniform and non Reactive Streams compatible ways of handling those (i.e., a check for canWrite which one has to spin over/retry periodically). There exist libraries that try to bridge the two worlds (RxNetty, some Spring) with varying degrees of success as I see it.
InfoQ: Do you think reactive frameworks are necessary to handle large amounts of traffic and real-time data?
Karnok: It depends on the problem complexity. For example, if your task is to count the number of characters in a big-data source, there are faster and more dedicated ways of doing that. If your task is to compose results from services on top of a stream of incoming data in order to return something detailed, reactive-based solutions are quite adequate. As far as I know, most libraries and frameworks around Reactive Streams were designed for throughput and not really for latency. For example, in high-frequency trading, the predictable latency is very important and can be easily met by Aeron but not the main focus for RxJava due to the unpredictable latency behavior.
InfoQ: Does HTTP/2 help solve the scalability issues that HTTP/1.1 has?
Karnok: This is not related to RxJava and I personally haven't played with HTTP/2 but only read the spec. Multiplexing over the same channel is certainly a win in addition to the support for explicit backpressure (i.e., even if the network can deliver, the client may still be unable to process the data in that volume) per stream. I don't know all the technical details but I believe Spring Reactive Web does support HTTP/2 transport if available but they hide all the complexity behind reactive abstractions so you can express your processing pipeline in RxJava 2 and Reactor 3 terms if you wish.
InfoQ: Java 9 is projected to be featuring some reactive functionality. Is that a complete spec?
Karnok: No. Java 9 will only feature 4 Java interfaces with 7 methods total. No stream-like or Rx-like API on top of that nor any JDK features built on that.
InfoQ: Will that obviate the need for RxJava if it is built right into the JDK?
Karnok: No and I believe there's going to be more need for a true and proven library such as RxJava. Once the toolchains grow up to Java 9, we will certainly provide adapters and we may convert (or rewrite) RxJava 3 on top of Java 9's features (VarHandles).
One of my fears is that once Java 9 is out, many will try to write their own libraries (individuals, companies) and the "market" gets diluted with big branded-low quality solutions, not to mention the increased amount of "how does RxJava differ from X" questions.
My personal opinion is that this is already happening today around Reactive Streams where certain libraries and frameworks advertise themselves as RS but fail to deliver based on it. My (likely biased) conjecture is that RxJava 2 is the closest library/technology to an optimal Reactive Streams-based solution that can be.
InfoQ: What's next for RxJava?
Karnok: We had fantastic reviewers, such as Jake Wharton, during the development of RxJava 2. Unfortunately, the developer previews and release candidates didn't generate enough attention and despite our efforts, small problems and oversights slipped into the final release. I don't expect major issues in the coming months but we will keep fixing both version 1 and 2 as well as occasionally adding new operators to support our user base. A few companion libraries, such as RxAndroid, now provide RxJava 2 compatible versions, but the majority of the other libraries don't yet or haven't yet decided how to go forward. In terms of RxJava, I plan to retire RxJava 1 within six months (i.e., only bugfixes then on), partly due to the increasing maintenance burden on my "one man army" for some time now; partly to "encourage" the others to switch to the RxJava 2 ecosystem. As for RxJava 3, I don't have any concrete plans yet. There are ongoing discussions about splitting the library along types or along backpressure support as well as making the so-called operator-fusion elements (which give a significant boost to our performance) a standard extension of the Reactive Streams specification.
0 notes
latestdroidtk-blog · 8 years ago
Text
The RxJava team has released version 2.0 of their reactive Java framework
The RxJava team has released version 2.0 of their reactive Java framework, after an 18 month development cycle. RxJava is part of the ReactiveX family of libraries and frameworks, which is in their words, "a combination of the best ideas from the Observer pattern, the Iterator pattern, and functional programming". The project's "What's different in 2.0" is a good guide for developers already familiar with RxJava 1.x.
RxJava 2.0 is a brand new implementation of RxJava. This release is based on the Reactive Streams specification, an initiative for providing a standard for asynchronous stream processing with non-blocking back pressure, targeting runtime environments (JVM and JavaScript) as well as network protocols.
Reactive implementations have concepts of publishers and subscribers, as well as ways to subscribe to data streams, get the next stream of data, handle errors and close the connection.
The Reactive Streams spec will be included in JDK 9 as java.util.concurrent.Flow. The following interfaces correspond to the Reactive Streams spec. As you can see, the spec is small, consisting of just four interfaces:
·         Flow.Processor<T,R>: A component that acts as both a Subscriber and Publisher.
·         Flow.Publisher<T>: A producer of items (and related control messages) received by Subscribers.
·         Flow.Subscriber<T>: A receiver of messages.
·         Flow.Subscription: Message control linking a Flow.Publisher and Flow.Subscriber.
Spring Framework 5 is also going reactive. To see how this looks, refer to Josh Long's Functional Reactive Endpoints with Spring Framework 5.0.
To learn more about RxJava 2.0, InfoQ interviewed main RxJava 2.0 contributor, David Karnok.
InfoQ: First of all, congrats on RxJava 2.0! 18 months in the making, that's quite a feat. What are you most proud of in this release?
David Karnok: Thanks! In some way, I wish it didn't take so long. There was a 10 month pause when Ben Christensen, the original author who started RxJava, left and there was no one at Netflix to push this forward. I'm sure many will contest that things got way better when I took over the lead role this June. I'm proud my research into more advanced and more performant Reactive Streams paid off and RxJava 2 is the proof all of it works.
InfoQ: What’s different in RxJava 2.0 and how does it help developers?
Karnok: There are a lot of differences between version 1 and 2 and it's impossible to list them all here ,but you can visit the dedicated wiki page for a comprehensible explanation. In headlines, we now support the de-facto standard Reactive Streams specification, have significant overhead reduction in many places, no longer allow nulls, have split types into two groups based on support of or lack of backpressure and have explicit interop between the base reactive types.
InfoQ: Where do you see RxJava used the most (e.g. IoT, real-time data processing, etc.)?
Karnok: RxJava is more dominantly used by the Android community, based on the feedback I saw. I believe the server side is dominated by Project Reactor and Akka at the moment. I haven't specifically seen IoT mentioned or use RxJava (it requires Java), but maybe they use some other reactive library available on their platform. For real-time data processing people still tend to use other solutions, most of them not really reactive, and I'm not aware of any providers (maybe Pivotal) who are pushing for reactive in this area.
InfoQ: What benefits does RxJava provide Android more than other environments that would explain the increased traction?
Karnok: As far as I see, Android wanted to "help" their users solving async and concurrent problems with Android-only tools such as AsyncTask, Looper/Handler etc.
Unfortunately, their design and use is quite inconvenient, often hard to understand or predict and generally brings frustration to Android developers. These can largely contribute to callback hell and the difficulty of keeping async operations off the main thread.
RxJava's design (inherited from the ReactiveX design of Microsoft) is dataflow-oriented and orthogonalized where actions execute from when data appears for processing. In addition, error handling and reporting is a key part of the flows. With AsyncTask, you had to manually work out the error delivery pattern and cancel pending tasks, whereas RxJava does that as part of its contract.
In practical terms, having a flow that queries several services in the background and then presents the results in the main thread can be expressed in a few lines with RxJava (+Retrofit) and a simple screen rotation will cancel the service calls promptly.
This is a huge productivity win for Android developers, and the simplicity helps them climb the steep learning curve the whole reactive programming's paradigm shift requires. Maybe at the end, RxJava is so attractive to Android because it reduces the "time-to-market" for individual developers, startups and small companies in the mobile app business.
There was nothing of a comparable issue on the desktop/server side Java, in my opinion, at that time. People learned to fire up ExecutorService's and wait on Future.get(), knew about SwingUtilities.invokeLater to send data back to the GUI thread and otherwise the Servlet API, which is one thread per request only (pre 3.0) naturally favored blocking APIs (database, service calls).
Desktop/server folks are more interested in the performance benefits a non-blocking design of their services offers (rather than how easy one can write a service). However, unlike Android development, having just RxJava is not enough and many expect/need complete frameworks to assist their business logic as there is no "proper" standard for non-blocking web services to replace the Servlet API. (Yes, there is Spring (~Boot) and Play but they feel a bit bandwagon-y to me at the moment).
InfoQ: HTTP is a synchronous protocol and can cause a lot of back pressure when using microservices. Streaming platforms like Akka and Apache Kafka help to solve this. Does RxJava 2.0 do anything to allow automatic back pressure?
Karnok: RxJava 2's Flowable type implements the Reactive Streams interface/specification and does support backpressure. However, the Java level backpressure is quite different from the network level backpressure. For us, backpressure means how many objects to deliver through the pipeline between different stages where these objects can be non uniform in type and size. On the network level one deals with usually fixed size packets, and backpressure manifests via the lack of acknowledgement of previously sent packets. In classical setup, the network backpressure manifests on the Java level as blocking calls that don't return until all pieces of data have been written. There are non-blocking setups, such as Netty, where the blocking is replaced by implicit buffering, and as far as I know there are only individual, non-uniform and non Reactive Streams compatible ways of handling those (i.e., a check for canWrite which one has to spin over/retry periodically). There exist libraries that try to bridge the two worlds (RxNetty, some Spring) with varying degrees of success as I see it.
InfoQ: Do you think reactive frameworks are necessary to handle large amounts of traffic and real-time data?
Karnok: It depends on the problem complexity. For example, if your task is to count the number of characters in a big-data source, there are faster and more dedicated ways of doing that. If your task is to compose results from services on top of a stream of incoming data in order to return something detailed, reactive-based solutions are quite adequate. As far as I know, most libraries and frameworks around Reactive Streams were designed for throughput and not really for latency. For example, in high-frequency trading, the predictable latency is very important and can be easily met by Aeron but not the main focus for RxJava due to the unpredictable latency behavior.
InfoQ: Does HTTP/2 help solve the scalability issues that HTTP/1.1 has?
Karnok: This is not related to RxJava and I personally haven't played with HTTP/2 but only read the spec. Multiplexing over the same channel is certainly a win in addition to the support for explicit backpressure (i.e., even if the network can deliver, the client may still be unable to process the data in that volume) per stream. I don't know all the technical details but I believe Spring Reactive Web does support HTTP/2 transport if available but they hide all the complexity behind reactive abstractions so you can express your processing pipeline in RxJava 2 and Reactor 3 terms if you wish.
InfoQ: Java 9 is projected to be featuring some reactive functionality. Is that a complete spec?
Karnok: No. Java 9 will only feature 4 Java interfaces with 7 methods total. No stream-like or Rx-like API on top of that nor any JDK features built on that.
InfoQ: Will that obviate the need for RxJava if it is built right into the JDK?
Karnok: No and I believe there's going to be more need for a true and proven library such as RxJava. Once the toolchains grow up to Java 9, we will certainly provide adapters and we may convert (or rewrite) RxJava 3 on top of Java 9's features (VarHandles).
One of my fears is that once Java 9 is out, many will try to write their own libraries (individuals, companies) and the "market" gets diluted with big branded-low quality solutions, not to mention the increased amount of "how does RxJava differ from X" questions.
My personal opinion is that this is already happening today around Reactive Streams where certain libraries and frameworks advertise themselves as RS but fail to deliver based on it. My (likely biased) conjecture is that RxJava 2 is the closest library/technology to an optimal Reactive Streams-based solution that can be.
InfoQ: What's next for RxJava?
Karnok: We had fantastic reviewers, such as Jake Wharton, during the development of RxJava 2. Unfortunately, the developer previews and release candidates didn't generate enough attention and despite our efforts, small problems and oversights slipped into the final release. I don't expect major issues in the coming months but we will keep fixing both version 1 and 2 as well as occasionally adding new operators to support our user base. A few companion libraries, such as RxAndroid, now provide RxJava 2 compatible versions, but the majority of the other libraries don't yet or haven't yet decided how to go forward. In terms of RxJava, I plan to retire RxJava 1 within six months (i.e., only bugfixes then on), partly due to the increasing maintenance burden on my "one man army" for some time now; partly to "encourage" the others to switch to the RxJava 2 ecosystem. As for RxJava 3, I don't have any concrete plans yet. There are ongoing discussions about splitting the library along types or along backpressure support as well as making the so-called operator-fusion elements (which give a significant boost to our performance) a standard extension of the Reactive Streams specification.
0 notes
netmetic · 5 years ago
Text
The Scalability Downside of Static Topics; Learning from LinkedIn’s Implementation of Apache Kafka
Kafka was born as a stream processing project of LinkedIn’s engineering team, so a recent blog post by two current members of that team titled, How LinkedIn customizes Apache Kafka for 7 trillion messages per day, caught my attention. In it, the authors give a detailed look at what it takes to keep a large-scale Kafka system running, and they talk about some steps they’ve taken to address scalability and operability challenges.
In short:
They created a tool called Brooklin to improve upon MirrorMaker, Kafka’s default cross-site replication tool.
They created a monitoring and allocation tool called Cruise Control to make it easier to deal with daily broker failures.
Finally, and ironically, they’ve written their own branch of Kafka.
The originators of Kafka have found it necessary to customize and create additional tooling to achieve the scalability they need in today’s technology landscape, which has changed a lot since its inception. The reality is that there are other ways to optimize your system for unlimited scalability and easy operability.
Losing Balance – A Cascading Failure
Before I get to that, let’s take a look at Cruise Control, which on the surface seems like a cool tool that niftily rebalances topics for you. There are two aspects to rebalancing topics:
Consumer Scaling: Adding consumers to a consumer group to enable load balancing requires the consumer groups to have the topic partitions rebalanced to ensure the load is distributed to the consumers. This part of the Kafka architecture keeps the broker very simple, but it puts the emphasis on the client to maintain state and negotiate with other consumers to determine which partitions to consume from.
High Availability: For high availability, should a broker fail, the broker load must be rebalanced to avoid “hot spots” of load throughout the cluster. If the rebalancing is done poorly, the failure of one broker causes an asymmetric distribution of load to the other brokers. This can cause one of the remaining brokers to become overloaded and fail, leading to a cascading failure. This is where Cruise Control comes in; I suspect LinkedIn found that with so many brokers, they were spending too much time rebalancing topics.
To summarize:
The Kafka architecture simplifies the broker to increase performance and eliminate state on the broker;
This forces application load-balancing logic onto the client, and the client manages state; and,
Replica clustering with a static topic structure requires constant monitoring and maintenance of the cluster to ensure load is evenly balanced.
Alternative Architectural Approaches to Consider
A smarter broker maintains state (such as “last event read”), which simplifies the overall architecture by allowing for stateless clients (desirable for those building microservices). The trade-off is a more complex broker, which must maintain state.
How does this affect scaling? Well, the broker must replicate state. Kafka is already replicating state (the topic), so that isn’t a problem. Obviously, the clients need to connect to wherever this state is held.
A smarter broker deals with client load balancing, which enables automated, dynamic application load balancing without any interruptions so there’s no more waiting for acquiesce.
Replication clustering distributes data replication throughout brokers in a cluster, which improves flexibility because as long as the pool of brokers is available to absorb the load, replication can be distributed.
This does, however, introduce the need to constantly re-distribute the load (hence Cruise Control), and makes monitoring and management tricky. How do I know how many extra brokers I will need? What is the granularity of my load? How do I ensure my replica is not in the same availability pool?
It’s better to take a more static approach: replicating to a known hot spare. Note that this does not preclude clustering; the hot spare is simply for data replication, not broker load balancing. This approach means that state-sharing isn’t a problem since the clients simply re-connect to the hot spare. Unfortunately, you need to provision twice as many brokers as you otherwise would.
The Secret to Scaling: Number of Brokers or Number of Messages?
One of the cited advantages of Kafka is its scalability. For instance, in this post on DZone, Matija Gobec describes Kafka’s scalability as “one of the key features” and notes that “this is where Kafka excels.” But as I discussed earlier, even the creators of Kafka found it necessary to create their own tools to address scalability.
Let’s take a step back and think about the title of their post from an architectural standpoint. 7 trillion events a day? That’s 81M events a second. Over Twitter’s 325M monthly active users, each would have to tweet every 4 seconds, every minute, every day. LinkedIn has 645M users, which means that during my 8-hour working day, I am responsible for nearly an event a second.
Does this correspond to 7T ingress events? In other words, are users generating all of these events directly? This doesn’t seem likely, so how can we account for this? Well, every interaction with LinkedIn probably generates a tidal wave of events – think of every mouseover, click, and update request.
Our first clue is the need to write events to remote clusters using Brooklin. In this case, a single ingress event may be counted multiple times. Check out the clue in their blog: LinkedIn uses 100,000 topics. Let’s explore that further.
The Thing About Static Topic Architecture…
100,000 topics sounds like a lot, but over 645 million users, that’s an extremely coarse-grained topic structure. I can only divide a stream over an average of 6,450 users. These topics are also static. Remember, I can’t easily use topics with dynamic data – such as a username, or a short-lived artifact like an update.
It might seem like this would work quite well if the topic acts as an aggregator – say a clickstream – since only one topic is needed no matter how many users there are. But what happens if you are looking for a specific, important event within that clickstream, such as delete my account? If I wish to create an application to monitor these special events, I have some choices to make:
I can have my application monitor all events;
I could create a new topic to which my publishers must duplicate their publishing; or,
I could have my existing clickstream applications republish the events of interest.
Republishing, Republishing, Republishing
It takes engineering work to balance loads when creating a new topic. You have to change publishers to duplicate publish to the new topic (which breaks decoupling), or have existing processors republish the specific events (which reduces less coupling but quickly multiplies the event stream). You could use an index key, but only have one key, which severely restricts the filtering this can help with, so you end up having to republish events multiple times. More republishing!
Monitoring the existing clickstream is clean architecturally, but it means your application is ignoring the majority of the events it receives.
Another problem with a static topic architecture is that publishers change over time, as do consumers. How do you define a static topic structure that accommodates the needs of your current applications and those of applications you haven’t thought of yet? The topics can be generic to allow wide re-use, but the expense of being generic is inefficiency.
What I’m describing here is topic filtering – a function of an implementation of the publish/subscribe pattern that allows subscribers to listen only to events of interest. The coarser a topic, the less likely they represent only events of interest, and static topics make filtering even more difficult.
Another function of publish/subscribe is topic routing – the ability to route events to specific remote destinations selectively. This is important for inter-cluster and inter-region communication because by selecting events you can ensure that only events that are needed get moved, which reduces load on brokers, networks, and clients. A static topic structure condemns us to duplication, republishing, or some other form of inefficiency.
You can also exchange subscription information among brokers, in much the same way that IP routing shares network segment information. With the right subscription propagation facility, the brokers can move events only to where they are needed. Load balancing is dynamic, so you can scale consumer counts and move consumers around brokers as needed.
A rich, hierarchical topic structure with wildcard filtering has proven to be easy to use, high performance, scalable, and robust. We know that a bank that tested Kafka’s static topic structure found that by moving to a dynamic topic structure their throughput and storage requirements were reduced by over 50x.
Flat Topic Structures Often Do Not Scale
The conclusion I draw from this is that while the Kafka architecture addresses the need to scale throughput on a broker, it has ignored the need to scale the topic structure, which results in:
Excess client state keeping requirements, making client scaling hard.
A multiplier effect on events, as they must be republished or duplicated to avoid overwhelming clients with events they are not interested in receiving.
If the topic structure is to act as an aggregation channel, this trade-off works well. However, as derived events are generated in reaction to the aggregated stream, the trade-off breaks down. Scaling becomes non-linear as the number of events rises. For example, instead of having two brokers dealing with ten events, you have 10 Kafka brokers dealing with 100 events.
A dynamic, rich, hierarchical topic structure allows topic routing and filtering, which avoids the multiplication of events that you get with a flat topic structure. Such a hierarchical, dynamic structure does not, however, come without challenges.
The Challenges of Dynamic Topic Structure
Since topics are dynamic, it takes more up-front work to define a topic structure or schema that defines which fields go where in the topic hierarchy, what data is suitable for inclusion in the topic, what is not suitable, etc.
You also need to consider governance. You can’t point at a topic and decide who can and can’t access it; you need to use the topic structure to create rules and policies. This approach allows fine-grained control over who accesses what data so that effort yields more flexibility.
Lastly, the broker has to do more work, like perform wildcard and hierarchical topic lookups. Carefully constructed, this can be restricted to simple string matching, which can perform the majority of filtering required, including geo-fencing. And you must propagate subscription information around the brokers, a non-trivial task.
Put simply, dynamic, hierarchical topics add complexity and load to the broker and require careful architectural design (tooling for topic design is now emerging). Static, flat topic structures, on the other hand, simplify the broker but cost more in ancillary broker tooling and architectural complexity.
Conclusion: Don’t Use a Hammer on a Screw
Kafka excels at high-rate data aggregation, but there are downsides to applying it to the wrong use cases. For example, microservices aren’t readily served by Kafka due to the client complexity, difficulties in scaling consumers across topic partitions, and state-keeping requirements.
It is tempting to use Kafka to distribute data derived from your aggregate stream, but that frequently amounts to moving dis-aggregated data, which Kafka isn’t good at. A smart broker takes derived data and routes it in a filtered manner to appropriate destinations, eliminating the multiplier effect that static, flat topic structures have on this kind of data. A hierarchical topic structure allows filters on multiple fields (instead of just a single key), which eliminates the event multiplier effect often seen with Kafka as developers struggle to make sure consumers get data of interest – which can lead to a hair-raising number of messages.
Perhaps the question architects and developers should be asking is not “How many messages can I move?” to determine scaling capabilities and requirements, but rather “How many messages should I be moving?”
Talk to Solace: we can help you understand this trade-off and show you how a smart event broker can optimize your architecture.
The post The Scalability Downside of Static Topics; Learning from LinkedIn’s Implementation of Apache Kafka appeared first on Solace.
The Scalability Downside of Static Topics; Learning from LinkedIn’s Implementation of Apache Kafka published first on https://jiohow.tumblr.com/
0 notes
netmetic · 6 years ago
Text
Replicating the Success of REST in Event-Driven Architecture
This post is a collaboration between Fran Méndez of AsyncAPI and Solace’s Jonathan Schabowsky.
In my last blog post, I explained how the loose coupling of applications associated with event-driven architecture and publish/subscribe messaging is both a strength and a weakness. As part of that, I touched on the fact that request/reply interactions using RESTful APIs are still the dominant application integration paradigm, even in hybrid cloud, machine learning and IoT use cases that benefit from event-driven interactions. There’s still tons of use cases for which RESTful request/reply interactions are perfect, but it’s important to be able to mix and match the right exchange pattern (Command, Query and Event) for the job especially where event-driven would be best suited.
In many cases, exploring why one thing has established or maintained popularity can help you understand why something else isn’t quite as hot, even though it seems like it should be. With this post I’ll investigate why the use of RESTful APIs is still so prevalent, and see if the reasons for its persistent popularity might act as a blueprint for making event-driven popular and mainstream. So, how did REST come to be the most popular way to connect applications? And why does everyone think it’s so easy?
How did REST get to be so hot?
REST’s popularity arose out of the need for data exchange and interactions between the web browser and backend services. In that context it became a de facto standard because it integrated so well with JavaScript and was so much easier than SOAP (a decent protocol that became bloated and complicated over time). From there, developers started using REST to connect internal enterprise applications, IoT devices and even microservices. It might not have been the best fit for all those use cases, but it got the job done.
As Matt McLarty mentions in his blog post Overcoming RESTlessness, a complete examination about why REST started to be used in places that it’s not ideal for “would ignore the power that comes from REST’s universality.” He’s referring to the fact that REST has become universal because developers “get it” and it’s surrounded by a thriving ecosystem of complementary technology and tools. Without this ecosystem that REST inherited from the web world, that universal adoption simply would not have happened.
The Building Blocks of REST’s Success
If you look closely at this ecosystem (foreshadowing) you can see that it’s composed of some foundational components upon which the open source and vendor community have built what I’ll call “enablement tooling.” Here’s what I mean:
Foundational Components
Web servers were the workhorse of the web for years before REST came into existence. They were much simpler than the application servers of the time and optimized to deal with large numbers of lightweight request/reply communications interactions like serving up a web page that somebody requests.
Development frameworks like Spring, JAX-RS, Restlet and Node.js reflect the fact that people invested time and energy to make the developer experience easy, i.e. keeping them from having to write boilerplate connection code so they could focus on the hard part of developing and refining business logic.
Security frameworks like OAUTH for authentication and authorization, and TLS for encryption, established the means by which interactions and information can be made secure.
Enablement Tooling
API Management: Companies like Apigee and MuleSoft built platforms that provide an API portal so developers can describe and discover APIs in design-time, API gateways to ensure security, management and API mediation, and finally usage analytics which inform which APIs are most and least used. These API management solutions are used increasingly for sophisticated API creation and design, and to act as API marketplaces.
Runtime API Discovery: As APIs and applications have become increasingly dynamic and distributed due to continuous delivery, containerization, cloud-bursting, discovery tooling such as Netflix Eureka and Istio/Envoy (service mesh) have been created to reduce the complexity of API clients and enable them to connect to services anywhere.
Specification for API Description: OpenAPI was created as a machine-readable metadata specification in order to document, design and consume APIs. This is incredibly valuable for use by testing tools, clients and document generation.
Code Generation Tools: Swagger and its associated code generation tooling lets developers easily take an OpenAPI definition and generate either client or server code, drastically reducing the amount of work it takes development teams to use APIs.
Without the foundational components, not only would the enablement tooling not have been possible, there wouldn’t have been any need or demand for it. This ecosystem of tools has facilitated REST’s ascension to its position as the de facto standard for application interactions today. While I lament the fact that event-driven hasn’t achieved this same level of adoption and devotion, I understand why, and know that without similar tooling it never will.
How Event-Driven is Following in REST’s Footsteps
There is no reason why the event-driven world can’t learn from the RESTful API world by leveraging and developing similar foundational components and enablement tools. In fact, some very exciting initiatives are underway and picking up steam in the industry and within Solace:
Foundational Components
Event Brokers: This one is easy as many simple (RabbitMQ, ActiveMQ) and advanced event brokers (Solace PubSub+, Kafka) exist today. Many of them are battle-tested and used widely in organizations that are event-driven.
Development Frameworks: Spring Cloud Stream makes writing event-driven microservices easy, and Paho for MQTT makes it easy to create event-driven IoT sensors in many programming languages.
Security: Frameworks like OAuth enable authentication and authorization in the event-driven world along with TLS for encryption for confidentiality/integrity.
Enablement Tooling
Event Management: While advanced event brokers perform many functions similar to those of an API Gateway, no vendor offers a platform that does everything for events that API management platforms do for RESTful API interactions. There are no “event portals” for developers to use, for example, in order to design, document and discover events.
Runtime Event Discovery: In the Eventing world, the ability to deliver events to consumers is even more complicated than with APIs because of the combination of 1-many event distribution, guaranteed and in-order quality of services along with event producers and consumers being just as dynamic and distributed as what is found with APIs. This has challenged infrastructure and operations teams for years all while client applications should not be burdened with these complexities. The event mesh is an emerging architectural concept that provides similar functionality to the service mesh but is targeted towards asynchronous interaction patterns. This removes the complexities previously described by enabling producers and consumers to exchange events regardless of where they are physically deployed all while maintaining event delivery qualities of service.
API Description Specification: AsyncAPI is on a mission to standardize event-driven API interactions and support the wide variety of messaging systems available. This is a corollary to OpenAPI – a universal language for all the different messaging protocols and event schema formats. The purpose of AsyncAPI is to enable architects and developers to specify the event payload definition, channel name, application/transport headers and protocol– thus fully specifying the application’s event-driven interface. This was previously not available but, thanks to Fran Méndez and the AsyncAPI Initiative, event-driven applications will receive the same love as RESTful APIs.
Code Generation Tools: AsyncAPI is also working in this direction. For instance, the ability to take an AsyncAPI definition and generate event-driven applications is underway for Spring Cloud Stream. This will drastically reduce the effort to create new applications!
Conclusion
EDA’s popularity has started to drastically increase as many companies are realizing they MUST react in real-time to their customers, decouple their systems and transform into event-driven organizations. However, for event-driven interactions to achieve the same level of adoption as REST, the build-out of tooling for eventing must continue. Now is the time to transform and support all the patterns modern applications need for interaction, i.e. commands, queries… and events!
Solace is committed to helping organizations realize the advantages of being event-driven. We’re active on all these fronts by continuing to advance the state of the art with our PubSub+ event broker and event mesh, enthusiastically supporting Spring Cloud Streams, and actively contributing expertise and financial support to AsyncAPI. Stay tuned for more information around how event management and API management are similar, how it is a key capability that organizations need, and what Solace is doing about it!
The post Replicating the Success of REST in Event-Driven Architecture appeared first on Solace.
Replicating the Success of REST in Event-Driven Architecture published first on https://jiohow.tumblr.com/
0 notes